WL#751: Error message construction
Affects: Server-5.5
—
Status: Complete
This task is a part for Sun Globalization Requirements, category "Externalized messages and Message construction". There will be a new way to produce error messages, taking into account the user's character set and language preferences. This affects errmsg.sys, errmsg.txt, internals documentation, string formatting, and @@character_set_results.
The current way of producing error messages does not allow to choose a language per user, and has a number of problems with character sets not being properly converted. See the section "References" for details. New error messages source file: errmsg-utf8.txt =============================================== We will cease to use the old errmsg.txt file in MySQL 6.1, it is hard to maintain because it is a mixture of different character sets, it's hard to edit, and it gets broken from time to time. For example: BUG#39949 errmsg.txt in 6.0 is corrupted. Instead, we'll use errmsg-utf8.txt, which is a pure utf8 version of errmsg.txt. At time of writing, errmsg-utf8.txt is in sql/share in a tree that will be foundational for the Azalea release. New error message source file and comp_err ========================================== The compiled error message files at sql/share/[language-name]/errmsg.sys will store error messages in utf8 character set. That means no changes are needed in comp_err sources: comp_err will just take the errmsg-utf8.txt source file and write to the errmsg.sys compiled file without any conversion. Internals documentation ======================= As well as updates to the MySQL Reference Manual, the Documentation Person for this task has volunteered to change the Internals Manual about forming error messages, thus affecting http://forge.mysql.com/wiki/MySQL_Internals_Error_Messages Peter speculates that some people actually depend on the section about error messages. Construction procedure ====================== Whenever an error message is constructed, all character string arguments will be converted to the UTF8 character set. So we will mix UTF8 message patterns with UTF8 message parameters and get a pure UTF8 result. Identifiers (like table, database names, etc) will be copied AS IS (as they are stored in UTF8 internally). Character string constants will be converted from their character set to UTF8. Binary string constants will be substituted as follows: - Bytes in the range 0x20-0x7E will be substituted AS IS - Bytes in the ranges 0x00-0x1F and 0x7F-0xFF will use hex encoding REVERSE SOLIDUS + 'x' + hex(high part of byte) + hex(low part of byte) For example, for a failure inserting 0x41c39f in a VARBINARY unique column, the constructed error message might be a UTF8 string containing " Duplicate entry 'A\xC3\x9F' for key 1 " Since '\' REVERSE SOLIDUS itself is in the range 0x20-0x7e, it does not become '\x5C'. Alexander Barkov proposed that '\' can be printed as '\\', we'll leave that up to the implementor. Sending procedure ================= When a complete message is sent to the client, it will be converted from UTF8 to the character set used for result sets and messages sent from server to client, namely, to @@character_set_results. We considered adding a new separate variable, @@character_set_messages, for messages as opposed to result sets. There was no feature request for such a variable, so the existing variable @@character_set_results is fine. The @@character_set_messages proposal is dead. When @@character_set_results is NULL or 'BINARY' or 'UTF8', no conversion will happen and the messages will be sent to the client AS IS in the character set they are constructed in, namely, in UTF8. When @@character_set_results is one of the "real-multibyte" character sets (UCS2, UTF16, UTF32), we will make sure that error messages work fine by adding appropriate tests. [ 2009-07-03 A suggestion about "optimization" has been removed with consent of Supervisor and Architecture Reviewer. See progress notes. ] When @@character_set_results is not one of the above, or when one of the above conditions is not true, then messages are not sent AS IS, some or all characters in the message must be converted to @@character_set_results. If during conversion some character can't be converted to @@character_set_results, then it will be replaced with REVERSE SOLIDUS + hex Unicode code point value as in WL#3529 "Unicode escape sequences": - for characters from the BMP range (0000-FFFF) using \1234 notation - for characters outside BMP (010000-10FFFF) using \+123456 notation. For example if one tries to drop a table named ペ (KATAKANA LETTER PE which is U30DA), and @@character_set_results = 'latin1', the message is ERROR 1051 (42S02): Unknown table '\30DA' and this is an apparent behaviour change. Three things to remember are: (a) Unicode escape sequences are for characters. For Binary, see above. (b) There is no conversion of '\' itself. (c) This is only for error messages; ordinarily conversions convert unconvertible characters to '?'s. Some examples of escaping ========================= Here are some examples showing what the error message looks like if escaping may happen (because the string is binary or because a character is unconvertible). 1. Binary string CREATE TABLE t1 (a varbinary(2)); INSERT INTO t1 VALUES (0xC39F); INSERT INTO t1 VALUES (0xC39F); ERROR 1062 (23000): Duplicate entry '\xC3\x9F' for key 1 2. Character string SET NAMES utf8; CREATE TABLE t1 (a varchar(1) character set utf8); INSERT INTO t1 VALUES (0xC39F); INSERT INTO t1 VALUES (0xC39F); ERROR 1062 (23000): Duplicate entry 'ß' for key 1 3. Character string with un-convertible characters SET NAMES cp1251; /* Cyrillic charset, does not suppot SZ */ CREATE TABLE t1 (a varchar(2) character set utf8); INSERT INTO t1 VALUES (0xC39F); INSERT INTO t1 VALUES (0xC39F); ERROR 1062 (23000): Duplicate entry '\00DF' for key 1 4. Peter wants the following highlighted in some documentation somewhere, that there's an apparent behaviour change. Here's what he sees on his terminal: (ペ is KATAKANA LETTER PE): " mysql> SET NAMES UTF8; Query OK, 0 rows affected (0.00 sec) mysql> PREPARE stmt1 FROM 'drop table ペ'; Query OK, 0 rows affected (0.01 sec) Statement prepared mysql> SET @@character_set_results = latin1; Query OK, 0 rows affected (0.00 sec) mysql> EXECUTE stmt1; ERROR 1051 (42S02): Unknown table 'ペ' " But with the new rule he'll see ERROR 1051 (42S02): Unknown table '\30DA' New session variable @@lc_messages ================================== A new session variable @@lc_messages will be added, to specify which language error messages will be in when they are constructed (before they are sent to the client). The notation is similar to what we have for @@lc_time_names. For example: SET @@lc_messages='de_DE'; /* compare SET @@lc_time_names='de_DE' */ The default value will be GLOBAL lc_messages, which is initially 'en_US' (see below "New global variable @@lc_messages"). We considered allowing common language names, for example SET @@lc_messages='german'; /* compare message file name 'german.sys' */ but the decision was: we won't allow common language names. See the mapping between language and locale notations in the section "The list of languages" below. Changing SESSION.lc_messages will be possible on the fly. For example: SET lc_messages='de_DE';SET lc_messages='en_EN'; Attempts to set to invalid values will cause error 1231. For example: mysql> SET @@lc_messages=null; ERROR 1231 (42000): Variable 'lc_messages' can't be set to the value of 'NULL' The variables @@character_set_results and @@lc_messages are independent. If you change one, there is no automatic effect on the other. The setting of @@lc_messages does not affect SIGNAL or RESIGNAL. It should be clear that after SET @@lc_messages = 'de_DE'; SIGNAL SQLSTATE '11111' SET MYSQL_ERRNO=1051, MESSAGE_TEXT = 'X'; there will be no attempt to use the German text for error 1051. The setting of @@lc_messages affects messages at the time they are produced. Changing @@lc_messages after the message is produced, but before operations which read the messages such as SHOW WARNINGS or the future GET DIAGNOSTICS, does not cause a change of the messages. We do not expect that changes to LC_MESSAGES may affect statement replication. New global variable @@lc_messages ================================= A new global variable @@lc_messages will be added, to specify the default language for error messages. Changing of the global variable will be possible on the fly, with SUPER privilege required. After GLOBAL.lc_messages is changed, all new connecting clients will initialize their SESSION.lc_messages to the GLOBAL.lc_messages value. Existing clients won't be affected by changes in GLOBAL.lc_messages. The list of languages ===================== In version 6.1 the possible settings for lc_messages will include at least the languages we currently have error messages for: cs_CZ = czech da_DK = danish nl_NL = dutch en_US = english et_EE = estonian fr_FR = french de_DE = german el_GR = greek hu_HU = hungarian it_IT = italian ja_JP = japanese ko_KR = korean nb_NO or no_NO = norwegian (Bokmål) nn_NO = norwegian-ny (Nynorsk) pl_PL = polish pt_PT = portuguese ro_RO = romanian ru_RU = russian sr_YU = serbian sk_SK = slovak es_ES = spanish sv_SE = swedish uk_UA = ukrainian [ 2009-08-06 A requirement for sr_CS has been removed with consent of Supervisor and Architecture Reviewer. See dev-private emails "Re: WL#751 Error message construction". ] [ 2009-08-10 BUG#46633 Obsolete Serbian locale name proposes sr_RS (Serbian). Therefore the implementor may consider 'sr_RS' to be part of the above list. ] "japanese-sjis" will be removed, as it is just a repeat of "japanese" but in another character set. Also, if we're lucky enough to find volunteers, extra languages mentioned in these worklogs will be added: WL#4617 Translate error messages to Tier1 languages WL#4649 Translate error messages into Tier 2 languages New my.cnf section and client command line parameter: ===================================================== [ 2009-07-03 A suggestion about "mysql --lc_messages='de_DE'" has been removed with consent of Supervisor and Architecture Reviewer. See progress notes. Initially @@session.lc_messages is the same as @@global.lc_messages. ] New my.cnf section and server command line ========================================== It will be possible to set lc_messages in server command line: mysqld --lc-messages='de_DE'; and in server my.cnf or equivalent configuration file: [mysqld] lc_messages='de_DE' The old variable 'language' =========================== We won't remove the old variable 'language'. It will display the directory name of errmsg.sys, as usual. Messages and stored procedures ============================== There are a few possible choices which language stored programs (functions, stored procedures, triggers) use when raising an error text: - server - user language at the time the routine is created - user language at runtime, namely, @@lc_messages We'll use the "user language at runtime" choice. This may affect GET DIAGNOSTICS and SHOW WARNINGS and SHOW ERRORS: The language conversion must happen at the time the error is raised. That is because "SIGNAL ... SET MESSAGE_TEXT = 'x';" must remain 'x'. Therefore, if you say SET @@lc_messages='en_US'; DROP TABLE nonexistent_table; /* Causing an error message in English */ SET @@lc_messages='de_DE'; SHOW ERRORS; The SHOW ERRORS statement will show the message in English, not German. New formatting ============== We'll modify my_vsnprintf() to provide some new formatting facilities: - Positional arguments ... a digit and a dollar sign. In different languages, arguments may be in different relative positions. Positional arguments allow the error-message writer to avoid the awkward language or unnecessary long messages which result from having the arguments in the same order in all languages. The position must be a single digit from '0' to '9'. EXAMPLE: my_vsnprintf("%1$s %2$s", "one", "two") -> "one two" EXAMPLE: my_vsnprintf("%2$s %1$s", "one", "two") -> "two one" If one argument has a position, then all arguments must have a position. Duplicates, for example "%1$s %2$s %1$s" (with two '1's), are legal, the result will be that the '1' argument goes out twice. Gaps, for example "%1$s %3$s" (without any '2') are illegal. - Identifier escaping ... a grave accent, also known as a backtick. The current version (a) doesn't follow sql_mode, (b) escapes incorrectly. Instead of "Table `%s`.`%s` not found" we will write "Table %`s.%`s not found" That is, specifying backtick after '%' and before 's' will force escaping of the parameter to print a good identifier. Escaping will be done only when necessary, for example, when the identifier contains spaces. - Binary constants ... a hash mark, or two hash marks. Already now we have quite a few error messages that do the following: { char bug[256]; escape_for_error(buf, sizeof(buf), binary_data, binary_data_len); my_error(ER_SOME_ERROR, buf); } A new '#' modifier will change the '%s' behavior as follows: %#s - print string as is, escaping only bytes outside the range 0x20-0x7E Escaping happens the same way as with escaping of "Binary string constants": REVERSE SOLIDUS + 'x' + hex(high part of byte) + hex(low part of byte). EXAMPLE: my_vsnprintf("%#s", "A.B") -> "A\x2EB" [ 2009-07-03 A suggestion about "##s" has been removed with consent of Supervisor and Architecture Reviewer. See progress notes. ] Astute readers will have noticed that we describe "escaping" for section "Construction procedure", for section "Sending procedure", and for this section "New formatting" -- with different rules each time. We will make the optimistic assumption that the rules can't conflict, or be applied in sequence. Upgrading ========= Suppose I'm a user and I made my own changes to 6.0 errmsg.txt. What upgrading tool shall I use to change my errmsg.txt to errmsg-utf8.txt? Answer: there is no tool. Sergei Golubchik proposed to add some automation in the Makefile, but we don't know about that yet. The tentative decision is: People who changed errmsg.txt are "on their own" and we are not responsible for changing errmsg.txt appropriately. Miscellaneous ============= We will make sure that "BUG#1406 Tablename in Errormessage not in default characterset" is fixed by this WL by adding an appropriate test. We will leave it up to the implementor to decide whether to convert SQLSTATE as well as MESSAGE_TEXT. We will leave it up to the implementor to check whether FEDERATED supports "SET NAMES" and "SET lc_time_names". If so, then lc_messages should be supported as well. References: =========== Feature request at BUG#7722 "language should be a dynamic global server variable" Feature request at BUG#39926 "Language should belong to client instead of server" Feature request at BUG#28139 "client in user language" BUG#1406 Tablename in Errormessage not in default characterset
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.