WL#3529: Unicode Escape Sequences
Affects: Server-7.1
—
Status: Un-Assigned
MySQL will allow Unicode delimited identifiers e.g. U&"ID". MySQL will allow Unicode character string literals e.g. U&'c'. Inside Unicode delimited identifiers or Unicode character string literals, MySQL will allow Unicode escape sequences e.g. U&"***\30DA***", U&'\0000'.
Unicode delimited identifiers ----------------------------- Syntax: U ampersand double-quote Unicode-delimited-identifier-body double-quote The letter U is upper case, u is not allowed. There can be no spaces between U and ampersand and double-quote. MySQL allows ` instead of double-quote for identifiers, but U&`identifier` is illegal. A Unicode delimited identifier may appear in SQL statements wherever an identifier is acceptable. Unicode delimited identifiers should be subject to the same rules as other delimited identifiers for case sensitivity, see WL#922. But until WL#922 is done, these are all the same thing: Identifier, "Identifier" (sql_mode=ansi), `IDENTIFIER`, U&"Identifier". Examples: U&"Unicode Identifier", U&"SELECT" Unicode character string literals --------------------------------- Syntax: U ampersand single-quote Unicode-character-string-body single-quote The letter U is upper case, u is not allowed. There can be no spaces between U and ampersand and single-quote. MySQL allows " instead of single-quote for literals, but U&"character-string" is illegal, MySQL will interpret that as U&"identifier". A Unicode character string literal may appear in SQL statements wherever a character string is acceptable. Introducers are okay. The introducer character set can be any character set that MySQL supports. If there's no introducer, the rules are the same as for ordinary character string literals, and therefore the character set is character_set_connection. Therefore there is nothing about a "Unicode character string literal" which implies that the literal's character set is Unicode. There will be a conversion to '?' and a warning if the character is not in the (introduced or default) character set. Sometimes character strings are separated, for example: mysql> select 'a' 'b'; +----+ | a | +----+ | ab | +----+ 1 row in set (0.00 sec) With respect to Unicode character string literals, the syntax U&'a' 'b' is okay, it means the same as U&'ab'. But the syntax U&'a' U&'b' is illegal. Examples: U&'Unicode character string', U&'!', _utf8 U&'xxxxxx' Unicode escape sequences ------------------------ Syntax: \hhhh or \+hhhhhh Here 'h' stands for 'hexit', a hexadecimal digit, that is: 0123456789ABCDEF. The letters abcdef are okay too, case is not significant. The assumption is that no client program or driver will intercept the REVERSE SOLIDUS i.e. \ character. Notice the absence of the letter 'u'. The original idea was that Unicode escapes should appear as \unnnn but that seems to be taken from specifications for other languages/drivers. Unicode escape sequences may appear only inside Unicode delimited identifiers or Unicode character string literals. This might be controversial, since the original idea was that such escape sequences could appear in any strings. But if they appear in any strings, then what happens when somebody has legacy code which happened to contain '\'? Should such activity be controllable with yet another flag? Since MySQL will support supplementary characters starting with version 5.2 (WL#1213), there will be a need for the \+hhhhhh syntax. Suppose the value is \30DA. The number 30DA does not mean encoding is 30DA, it means "code point" is 30DA. Thus, even if the character set is SJIS, the escape sequence \30DA means KATAKANA LETTER PE because the Unicode page says so: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and it doesn't matter that SJIS uses a different number. Of course the character must be convertible to SJIS. So _sjis U&'\30DA' = _sjis 'ペ' is true, and _sjis U&'\30DA' = _sjis 0x8379 is true. All the older MySQL-style escape sequences described in the MySQL Reference Manual section 9.1.1. Strings, http://dev.mysql.com/doc/refman/5.0/en/string-syntax.html \0 \' \" \b \n \r \t \Z \\ \% \_ have no meaning within a Unicode character string literal. What if \ is followed by something other than + or hexits, for example U&'\555' or U&"A\Z"? Those are invalid escape sequences, but in the spirit of MySQL, accept them -- the \ has no special meaning if the + or hexits are absent. So how to store '\'? MySQL will allow UESCAPE (see later section). MySQL won't accept \\ (see earlier paragraph). The reliable way is with the escape for \. That is: U&'\005C' is interpreted as \ Examples: \00F6 \+0000F6 U&"AAAAA\00F6BBBBB" UESCAPE ------- Syntax: UESCAPE 'escape-character' The 'escape-character' is a single character, default '\' (reverse solidus), which may not be any of {0123456789ABCDEF+"} or whitespace. In the previous section the illustration was U&'\30DA' using the default \ escape character. If one wished to use ^ instead, one could say: U^'^30DA' UESCAPE '^' Perhaps this is needed if NO_BACKSLASH_ESCAPES is on. Using default escape character '\' is usually good enough, but not for sjis, big5, gbk, cp932. UESCAPE is a new reserved word. Quotes within quotes -------------------- Inside ordinary delimited identifiers and literals, " or ' may be doubled to indicate that they occur once, for example 'AB''CD' means AB'CD. The same thing is true for Unicode delimited identifiers and literals. Two occurrences of \0022 (i.e. QUOTATION MARK) are not to be interpreted the same way as two occurrences of ' inside a literal. That is: U&'AB''CD' is interpreted as AB'CD U&'AB\0022\0022CD' is interpreted as AB''CD Nifty non-standard ideas ------------------------ One could say that a Unicode escape sequence is legal anywhere in a statement, even outside identifiers or literals. One could dream up a special escape sequence for NULL. SHOW CREATE, mysqldump ---------------------- As soon as we support Unicode Escape Sequences, we can modify "SHOW CREATE TABLE" and other commands returning CREATE scripts for various objects. If some identifier, say table name, cannot be converted to the current character_set_results, then automatic escaping takes place: SET NAMES koi8r; CREATE TABLE "this is a non-Latin1 character Ё" (a int)"; SHOW CREATE TABLE "this is a non-Latin1 character Ё" (a int)" -> CREATE TABLE "this is a non-Latin1 character Ё" (a int) -- no escapes SHOW CREATE TABLE U&"this is a non-Latin1 character \0410" (a int)" -> CREATE TABLE "this is a non-Latin1 character Ё" (a int) -- no escapes I.e. SHOW CREATE output does not use escape, no matter whether SHOW CREATE query uses escapes. but SET NAMES latin1; SHOW CREATE TABLE U&"this is a non-Latin1 character \0410" -> CREATE TABLE U&"this is a non-Latin1 character \0410" -- with escape uses escapes, because Ё cannot be converted to Latin1. Note, if U&"..." can be displayed without escaping, then U& should not printed. As side effect, mysqldump will be able to write data into a dump file with any arbitrary --default-character-set without data loss. Compatibility ------------- By accepting this, MySQL will approximately comply with two SQL:2003 non-core features: F392, “Unicode escapes in identifiers”, F393, “Unicode escapes in literals”. But the use of \ is not common outside SQL, \u is more common. And other major DBMSs do not have these features, or do not have them in the same way. Examples: Escape sequences recognized by Oracle OLAP (*not* Oracle Database): http://www.stanford.edu/dept/itss/docs/oracle/10g/olap.101/b10339/datatypes001.htm Turning UnicodeEscape off in Derby: http://db.apache.org/derby/manuals/tools/tools74.html#IDX339 Mimer assumes 8 digit escape sequences but only for load/unload: http://developer.mimer.com/documentation/latest_html/Mimer_SQL_Engine_DocSet/Load4.html#wp1122548 IBM constants including U& strings: http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?topic=/com.ibm.db2.luw.sql.ref.doc/doc/r0000731.html Feature request: BUG#10199 Allow Unicode escape sequence for string literals.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.