WL#751: Error message construction

Affects: Server-5.5 — Status: Complete

Description
High Level Architecture

This task is a part for Sun Globalization Requirements,
category "Externalized messages and Message construction".

There will be a new way to produce error messages, taking into account
the user's character set and language preferences.

This affects errmsg.sys, errmsg.txt, internals documentation,
string formatting, and @@character_set_results.

The current way of producing error messages does not allow to choose
a language per user, and has a number of problems with character sets
not being properly converted. See the section "References" for details.


New error messages source file: errmsg-utf8.txt
===============================================
We will cease to use the old errmsg.txt file in MySQL 6.1,
it is hard to maintain because it is a mixture of
different character sets, it's hard to edit,
and it gets broken from time to time. For example:
BUG#39949 errmsg.txt in 6.0 is corrupted.

Instead, we'll use errmsg-utf8.txt, which is a pure utf8
version of errmsg.txt. 

At time of writing, errmsg-utf8.txt is in sql/share in a
tree that will be foundational for the Azalea release.

New error message source file and comp_err
==========================================
The compiled error message files at sql/share/[language-name]/errmsg.sys will 
store error messages in utf8 character set.
That means no changes are needed in comp_err sources:
comp_err will just take the errmsg-utf8.txt source file and write
to the errmsg.sys compiled file without any conversion.

Internals documentation
=======================
As well as updates to the MySQL Reference Manual, the Documentation Person
for this task has volunteered to change the Internals Manual
about forming error messages, thus affecting
http://forge.mysql.com/wiki/MySQL_Internals_Error_Messages 
Peter speculates that some people actually depend on the section about
error messages. 

Construction procedure
======================
Whenever an error message is constructed, all character string arguments
will be converted to the UTF8 character set. So we will mix UTF8 message
patterns with UTF8 message parameters and get a pure UTF8 result.

Identifiers (like table, database names, etc) will be copied
AS IS (as they are stored in UTF8 internally).

Character string constants will be converted from their character
set to UTF8.

Binary string constants will be substituted as follows:
- Bytes in the range 0x20-0x7E will be substituted AS IS
- Bytes in the ranges 0x00-0x1F and 0x7F-0xFF will use hex encoding
  REVERSE SOLIDUS + 'x' + hex(high part of byte) + hex(low part of byte)
For example, for a failure inserting 0x41c39f in a VARBINARY unique column,
the constructed error message might be a UTF8 string containing
"
Duplicate entry 'A\xC3\x9F' for key 1
"

Since '\' REVERSE SOLIDUS itself is in the range 0x20-0x7e,
it does not become '\x5C'. Alexander Barkov proposed that '\'
can be printed as '\\', we'll leave that up to the implementor.

Sending procedure
=================
When a complete message is sent to the client, it will be converted
from UTF8 to the character set used for result sets and messages
sent from server to client, namely, to @@character_set_results.

We considered adding a new separate variable, @@character_set_messages,
for messages as opposed to result sets. There was no feature request
for such a variable, so the existing variable @@character_set_results
is fine. The @@character_set_messages proposal is dead.

When @@character_set_results is NULL or 'BINARY' or 'UTF8',
no conversion will happen and the messages will be sent to the client
AS IS in the character set they are constructed in, namely, in UTF8.

When @@character_set_results is one of the "real-multibyte" character
sets (UCS2, UTF16, UTF32), we will make sure that error messages work
fine by adding appropriate tests.

[ 2009-07-03 A suggestion about "optimization" has been removed with
consent of Supervisor and Architecture Reviewer. See progress notes. ]

When @@character_set_results is not one of the above, or when one of
the above conditions is not true, then messages are not sent AS IS,
some or all characters in the message must be converted to
@@character_set_results. If during conversion some character can't be
converted to @@character_set_results, then it will be replaced
with REVERSE SOLIDUS + hex Unicode code point value as in WL#3529
"Unicode escape sequences":
- for characters from the BMP range (0000-FFFF) using \1234 notation
- for characters outside BMP (010000-10FFFF) using \+123456 notation.
For example if one tries to drop a table named ペ (KATAKANA LETTER PE
which is U30DA), and @@character_set_results = 'latin1', the message is
ERROR 1051 (42S02): Unknown table '\30DA'
and this is an apparent behaviour change. Three things to remember are:
(a) Unicode escape sequences are for characters. For Binary, see above.
(b) There is no conversion of '\' itself.
(c) This is only for error messages; ordinarily conversions convert
    unconvertible characters to '?'s.


Some examples of escaping
=========================

Here are some examples showing what the error message looks
like if escaping may happen (because the string is binary
or because a character is unconvertible).

1. Binary string
CREATE TABLE t1 (a varbinary(2));
INSERT INTO t1 VALUES (0xC39F);
INSERT INTO t1 VALUES (0xC39F);
ERROR 1062 (23000): Duplicate entry '\xC3\x9F' for key 1

2. Character string
SET NAMES utf8;
CREATE TABLE t1 (a varchar(1) character set utf8);
INSERT INTO t1 VALUES (0xC39F);
INSERT INTO t1 VALUES (0xC39F);
ERROR 1062 (23000): Duplicate entry 'ß' for key 1

3. Character string with un-convertible characters
SET NAMES cp1251; /* Cyrillic charset, does not suppot SZ */
CREATE TABLE t1 (a varchar(2) character set utf8);
INSERT INTO t1 VALUES (0xC39F);
INSERT INTO t1 VALUES (0xC39F);
ERROR 1062 (23000): Duplicate entry '\00DF' for key 1

4. Peter wants the following highlighted in some documentation
somewhere, that there's an apparent behaviour change.

Here's what he sees on his terminal: (ペ is KATAKANA LETTER PE):
"
mysql> SET NAMES UTF8;
Query OK, 0 rows affected (0.00 sec)

mysql> PREPARE stmt1 FROM 'drop table ペ';
Query OK, 0 rows affected (0.01 sec)
Statement prepared

mysql> SET @@character_set_results = latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> EXECUTE stmt1;
ERROR 1051 (42S02): Unknown table 'ペ'
"

But with the new rule he'll see
ERROR 1051 (42S02): Unknown table '\30DA' 


New session variable @@lc_messages
==================================
A new session variable @@lc_messages will be added,
to specify which language error messages will be in
when they are constructed (before they are sent to the client).
The notation is similar to what we have for @@lc_time_names. For example:

SET @@lc_messages='de_DE'; /* compare SET @@lc_time_names='de_DE' */

The default value will be GLOBAL lc_messages, which is initially
'en_US' (see below "New global variable @@lc_messages").

We considered allowing common language names, for example
SET @@lc_messages='german'; /* compare message file name 'german.sys' */
but the decision was: we won't allow common language names.

See the mapping between language and locale 
notations in the section "The list of languages" below.

Changing SESSION.lc_messages will be possible on the fly. For example:
SET lc_messages='de_DE';

SET lc_messages='en_EN';


Attempts to set to invalid values will cause error 1231. For example:
mysql> SET @@lc_messages=null;
ERROR 1231 (42000): Variable 'lc_messages' can't be set to the value of 'NULL'

The variables @@character_set_results and @@lc_messages are independent.
If you change one, there is no automatic effect on the other.

The setting of @@lc_messages does not affect SIGNAL or RESIGNAL.
It should be clear that after
SET @@lc_messages = 'de_DE';
SIGNAL SQLSTATE '11111' SET MYSQL_ERRNO=1051, MESSAGE_TEXT = 'X';
there will be no attempt to use the German text for error 1051.

The setting of @@lc_messages affects messages at the time they are produced.
Changing @@lc_messages after the message is produced, but before operations
which read the messages such as SHOW WARNINGS or the future GET DIAGNOSTICS,
does not cause a change of the messages.

We do not expect that changes to LC_MESSAGES may affect statement replication.

New global variable @@lc_messages
=================================
A new global variable @@lc_messages will be added,
to specify the default language for error messages.
Changing of the global variable will be possible on the fly,
with SUPER privilege required.

After GLOBAL.lc_messages is changed, all new connecting
clients will initialize their SESSION.lc_messages to
the GLOBAL.lc_messages value. Existing clients won't
be affected by changes in GLOBAL.lc_messages.


The list of languages
=====================
In version 6.1 the possible settings for lc_messages
will include at least the languages we currently have error messages for:

cs_CZ = czech
da_DK = danish
nl_NL = dutch
en_US = english
et_EE = estonian
fr_FR = french
de_DE = german
el_GR = greek
hu_HU = hungarian
it_IT = italian
ja_JP = japanese
ko_KR = korean
nb_NO or no_NO = norwegian (Bokmål)
nn_NO = norwegian-ny (Nynorsk)
pl_PL = polish
pt_PT = portuguese
ro_RO = romanian
ru_RU = russian
sr_YU = serbian
sk_SK = slovak
es_ES = spanish
sv_SE = swedish
uk_UA = ukrainian

[ 2009-08-06 A requirement for sr_CS has been removed
with consent of Supervisor and Architecture Reviewer.
See dev-private emails "Re: WL#751 Error message construction". ]

[ 2009-08-10 BUG#46633 Obsolete Serbian locale name proposes sr_RS (Serbian).
Therefore the implementor may consider 'sr_RS' to be part of the above list. ]

"japanese-sjis" will be removed, as it is just
a repeat of "japanese" but in another character set.

Also, if we're lucky enough to find volunteers,
extra languages mentioned in these worklogs will be added:

WL#4617 Translate error messages to Tier1 languages
WL#4649 Translate error messages into Tier 2 languages


New my.cnf section and client command line parameter:
=====================================================

[ 2009-07-03 A suggestion about "mysql --lc_messages='de_DE'" has been
removed with consent of Supervisor and Architecture Reviewer.
See progress notes.
Initially @@session.lc_messages is the same as @@global.lc_messages. ]

New my.cnf section and server command line
==========================================
It will be possible to set lc_messages in 
server command line:

mysqld --lc-messages='de_DE';

and in server my.cnf or equivalent configuration file:

[mysqld]
lc_messages='de_DE'


The old variable 'language'
===========================
We won't remove the old variable 'language'.
It will display the directory name of errmsg.sys,
as usual.


Messages and stored procedures
==============================
There are a few possible choices which language stored programs
(functions, stored procedures, triggers) use
when raising an error text:

- server
- user language at the time the routine is created
- user language at runtime, namely, @@lc_messages

We'll use the "user language at runtime" choice.

This may affect GET DIAGNOSTICS and SHOW WARNINGS and SHOW ERRORS:
The language conversion must happen at the time the error is raised.
That is because "SIGNAL ... SET MESSAGE_TEXT = 'x';" must remain 'x'.
Therefore, if you say
SET @@lc_messages='en_US';
DROP TABLE nonexistent_table; /* Causing an error message in English */
SET @@lc_messages='de_DE';
SHOW ERRORS;
The SHOW ERRORS statement will show the message in English, not German.

New formatting
==============
We'll modify my_vsnprintf() to provide some new formatting facilities:

- Positional arguments ... a digit and a dollar sign.
In different languages, arguments may be in different relative positions.
Positional arguments allow the error-message writer to avoid the
awkward language or unnecessary long messages which result
from having the arguments in the same order in all languages.
The position must be a single digit from '0' to '9'.
EXAMPLE: my_vsnprintf("%1$s %2$s", "one", "two")  -> "one two"
EXAMPLE: my_vsnprintf("%2$s %1$s", "one", "two")  -> "two one"
If one argument has a position, then all arguments must have a position.
Duplicates, for example "%1$s %2$s %1$s" (with two '1's), are legal,
the result will be that the '1' argument goes out twice.
Gaps, for example "%1$s %3$s" (without any '2') are illegal.

- Identifier escaping ... a grave accent, also known as a backtick.
The current version (a) doesn't follow sql_mode, (b) escapes incorrectly.
Instead of
  "Table `%s`.`%s` not found"
we will write
  "Table %`s.%`s not found" 
That is, specifying backtick after '%' and before 's'
will force escaping of the parameter to print a good identifier.
Escaping will be done only when necessary, for example, when
the identifier contains spaces.

- Binary constants ... a hash mark, or two hash marks.
Already now we have quite a few error messages that do the following:
  {
    char bug[256];
    escape_for_error(buf, sizeof(buf), binary_data, binary_data_len);
    my_error(ER_SOME_ERROR, buf);
  }
A new '#' modifier will change the '%s' behavior as follows:
%#s - print string as is, escaping only bytes outside the range 0x20-0x7E
Escaping happens the same way as with escaping of "Binary string constants":
REVERSE SOLIDUS + 'x' + hex(high part of byte) + hex(low part of byte).
EXAMPLE: my_vsnprintf("%#s", "A.B")  -> "A\x2EB"
[ 2009-07-03 A suggestion about "##s" has been removed with
consent of Supervisor and Architecture Reviewer. See progress notes. ]

Astute readers will have noticed that we describe "escaping" for section
"Construction procedure", for section "Sending procedure", and for this
section "New formatting" -- with different rules each time. We will make
the optimistic assumption that the rules can't conflict, or be applied
in sequence.

Upgrading
=========

Suppose I'm a user and I made my own changes to 6.0 errmsg.txt.
What upgrading tool shall I use to change my errmsg.txt to errmsg-utf8.txt?

Answer: there is no tool. Sergei Golubchik proposed to add some
automation in the Makefile, but we don't know about that yet.

The tentative decision is:
People who changed errmsg.txt are "on their own"
and we are not responsible for changing errmsg.txt appropriately.


Miscellaneous
=============
We will make sure that 
"BUG#1406 Tablename in Errormessage not in default characterset"
is fixed by this WL by adding an appropriate test.

We will leave it up to the implementor to decide whether to convert SQLSTATE
as well as MESSAGE_TEXT.

We will leave it up to the implementor to check whether
FEDERATED supports "SET NAMES" and "SET lc_time_names".
If so, then lc_messages should be supported as well.


References:
===========
Feature request at BUG#7722 "language should be a dynamic global server variable"
Feature request at BUG#39926 "Language should belong to client instead of server"
Feature request at BUG#28139 "client in user language"

BUG#1406 Tablename in Errormessage not in default characterset