WL#3529: Unicode Escape Sequences

Affects: Server-7.1 — Status: Un-Assigned

Description
High Level Architecture

MySQL will allow Unicode delimited identifiers e.g. U&"ID".
MySQL will allow Unicode character string literals e.g. U&'c'.
Inside Unicode delimited identifiers or Unicode
character string literals, MySQL will allow Unicode
escape sequences e.g. U&"***\30DA***", U&'\0000'.

Unicode delimited identifiers
-----------------------------

Syntax:

U ampersand double-quote Unicode-delimited-identifier-body double-quote

The letter U is upper case, u is not allowed.
There can be no spaces between U and ampersand and double-quote.

MySQL allows ` instead of double-quote for identifiers,
but U&`identifier` is illegal.

A Unicode delimited identifier may appear in SQL statements
wherever an identifier is acceptable.

Unicode delimited identifiers should be subject to the same
rules as other delimited identifiers for case sensitivity, see
WL#922. But until WL#922 is done, these are all the same thing:
Identifier, "Identifier" (sql_mode=ansi), `IDENTIFIER`, U&"Identifier".

Examples: U&"Unicode Identifier", U&"SELECT"

Unicode character string literals
---------------------------------

Syntax:

U ampersand single-quote Unicode-character-string-body single-quote

The letter U is upper case, u is not allowed.
There can be no spaces between U and ampersand and single-quote.

MySQL allows " instead of single-quote for literals,
but U&"character-string" is illegal, MySQL will interpret that
as U&"identifier".

A Unicode character string literal may appear in SQL statements
wherever a character string is acceptable.

Introducers are okay. The introducer character set can be any
character set that MySQL supports. If there's no introducer,
the rules are the same as for ordinary character string literals,
and therefore the character set is character_set_connection.
Therefore there is nothing about a "Unicode character string
literal" which implies that the literal's character set is
Unicode.

There will be a conversion to '?' and a warning if the character is not in
the (introduced or default) character set.

Sometimes character strings are separated, for example:
mysql> select 'a' 'b';
+----+
| a  |
+----+
| ab |
+----+
1 row in set (0.00 sec)
With respect to Unicode character string literals, the syntax
U&'a' 'b' is okay, it means the same as U&'ab'. But the syntax
U&'a' U&'b' is illegal.

Examples: U&'Unicode character string', U&'!', _utf8 U&'xxxxxx'

Unicode escape sequences
------------------------

Syntax:
\hhhh
or
\+hhhhhh

Here 'h' stands for 'hexit', a hexadecimal digit, that is:
0123456789ABCDEF. The letters abcdef are okay too, case is
not significant. The assumption is that no client program
or driver will intercept the REVERSE SOLIDUS i.e. \ character.

Notice the absence of the letter 'u'. The original idea was
that Unicode escapes should appear as \unnnn but that seems
to be taken from specifications for other languages/drivers.

Unicode escape sequences may appear only inside Unicode
delimited identifiers or Unicode character string literals.
This might be controversial, since the original idea was
that such escape sequences could appear in any strings.
But if they appear in any strings, then what happens when
somebody has legacy code which happened to contain '\'?
Should such activity be controllable with yet another flag?

Since MySQL will support supplementary characters starting
with version 5.2 (WL#1213), there will be a need for the
\+hhhhhh syntax.

Suppose the value is \30DA. The number 30DA does not mean
encoding is 30DA, it means "code point" is 30DA. Thus,
even if the character set is SJIS, the escape sequence \30DA
means KATAKANA LETTER PE because the Unicode page says so:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
and it doesn't matter that SJIS uses a different number.
Of course the character must be convertible to SJIS. So
_sjis U&'\30DA' = _sjis 'ペ' is true, and
_sjis U&'\30DA' = _sjis 0x8379 is true.

All the older MySQL-style escape sequences described in
the MySQL Reference Manual section 9.1.1. Strings,
http://dev.mysql.com/doc/refman/5.0/en/string-syntax.html
\0 \' \" \b \n \r \t \Z \\ \% \_
have no meaning within a Unicode character string literal.

What if \ is followed by something other than + or hexits,
for example U&'\555' or U&"A\Z"? Those are invalid escape
sequences, but in the spirit of MySQL, accept them --
the \ has no special meaning if the + or hexits are absent.

So how to store '\'? MySQL will allow UESCAPE (see
later section). MySQL won't accept \\ (see earlier paragraph).
The reliable way is with the escape for \. That is:
U&'\005C'               is interpreted as \

Examples:
\00F6 \+0000F6
U&"AAAAA\00F6BBBBB"

UESCAPE
-------

Syntax:
UESCAPE 'escape-character'

The 'escape-character' is a single character,
default '\' (reverse solidus), which may not
be any of {0123456789ABCDEF+"} or whitespace.

In the previous section the illustration was
U&'\30DA' using the default \ escape character.
If one wished to use ^ instead, one could say:
U^'^30DA' UESCAPE '^'
Perhaps this is needed if NO_BACKSLASH_ESCAPES is on.

Using default escape character '\' is usually
good enough, but not for sjis, big5, gbk, cp932.

UESCAPE is a new reserved word.

Quotes within quotes
--------------------

Inside ordinary delimited identifiers and literals,
" or ' may be doubled to indicate that they occur once,
for example 'AB''CD' means AB'CD. The same thing is true
for Unicode delimited identifiers and literals.

Two occurrences of \0022 (i.e. QUOTATION MARK) are not to
be interpreted the same way as two occurrences of ' inside
a literal. That is:
U&'AB''CD'              is interpreted as AB'CD
U&'AB\0022\0022CD'      is interpreted as AB''CD

Nifty non-standard ideas
------------------------

One could say that a Unicode escape sequence is legal
anywhere in a statement, even outside identifiers or literals.
One could dream up a special escape sequence for NULL.

SHOW CREATE, mysqldump
----------------------
As soon as we support Unicode Escape Sequences,
we can modify "SHOW CREATE TABLE" and other 
commands returning CREATE scripts for various objects.

If some identifier, say table name, cannot be converted to the
current character_set_results, then automatic escaping takes place:

SET NAMES koi8r;
CREATE TABLE "this is a non-Latin1 character Ё" (a int)";
SHOW CREATE TABLE "this is a non-Latin1 character Ё" (a int)"
->  CREATE TABLE "this is a non-Latin1 character Ё" (a int) -- no
escapes

SHOW CREATE TABLE U&"this is a non-Latin1 character \0410" (a int)"
->  CREATE TABLE "this is a non-Latin1 character Ё" (a int) -- no
escapes

I.e. SHOW CREATE output does not use escape, no matter
whether SHOW CREATE query uses escapes.

but

SET NAMES latin1;
SHOW CREATE TABLE U&"this is a non-Latin1 character \0410"
->
CREATE TABLE U&"this is a non-Latin1 character \0410" -- with escape

uses escapes, because Ё cannot be converted to Latin1.

Note, if U&"..." can be displayed without escaping, then U& should not printed.

As side effect, mysqldump will be able to write
data into a dump file with any arbitrary --default-character-set
without data loss.


Compatibility
-------------

By accepting this, MySQL will approximately comply with
two SQL:2003 non-core features:
F392, “Unicode escapes in identifiers”,
F393, “Unicode escapes in literals”.

But the use of \ is not common outside SQL, \u is more common.

And other major DBMSs do not have these features, or do not
have them in the same way. Examples:

Escape sequences recognized by Oracle OLAP (*not* Oracle Database):
http://www.stanford.edu/dept/itss/docs/oracle/10g/olap.101/b10339/datatypes001.htm

Turning UnicodeEscape off in Derby:
http://db.apache.org/derby/manuals/tools/tools74.html#IDX339

Mimer assumes 8 digit escape sequences but only for load/unload:
http://developer.mimer.com/documentation/latest_html/Mimer_SQL_Engine_DocSet/Load4.html#wp1122548

IBM constants including U& strings:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?topic=/com.ibm.db2.luw.sql.ref.doc/doc/r0000731.html

Feature request:
BUG#10199 Allow Unicode escape sequence for string literals.