Globalizing MySQL - Interview with Alexander "Bar" Barkov
We interviewed Alexander "Bar" Barkov to discuss the recent improvements in MySQL globalization that enable MySQL users around the world to use the database system more easily and naturally.
Q: Hi Bar, can you tell us a little bit about yourself? How long have you been working with MySQL and what is your role?
Bar: I started at MySQL AB in November 2001, together with other two colleagues from my home town of Izhevsk, Russia. Our first project was to provide support for GIS data types in MySQL 4.1, and I switched mostly to globalization tasks after that. The first few steps in globalization include: multiple character set support (per column, table, database), client-server character set conversion, Unicode character sets utf8 and ucs2, as well as Unicode collation algorithm support. In addition to globalization I also participated in various aspects of MySQL development, including fixing bugs, doing code reviews, occasionally developing codes for non-globalization tasks, and doing some architecture work from time to time, such as writing WorkLog tasks or doing architecture review for the task descriptions.
Q: You mentioned the term "globalization." What does this mean and why is it important for MySQL?
Bar: By globalization we mean two design aspects: "localization" as well as "internationalization".
Localization means tuning MySQL for a specific country or a language (a region). This, for example, includes support for the region-specific sorting orders (collations), national character sets, numeric formats, translated messages, as well as other aspects that make it more comfortable for international users to use MySQL. Localization is important for popularizing MySQL all around the world.
Internationalization, on the other hand, is the underlying infrastructure. It includes support for Unicode standards, collation customization code, and parameterizable error message routines. It's what makes localization easier.
Q: What are the key features or improvements in MySQL 5.5 in terms of internationalization?
Bar: An important task made in MySQL 5.5 was the support for supplementary characters. The contemporary Unicode versions assign a lot of characters outside of the Basic Multilingual Plane. This is particularly important for Chinese, Japanese and Korean users. Another important change we made in MySQL 5.5 is reading the Operating System localization information. Earlier MySQL versions always started client tools with latin1 character set by default. Now the client tools check the operating system environment and choose the character set best suited for the current machine. That's an important step towards one of the MySQL golden rules - "ease of use".
Another interesting 5.5 feature I would like to mention is internationalized number format. Prior to version 5.5, MySQL had only a hard-coded US numeric format, with dot as the decimal separator, comma as the thousand separator, and with three grouping digits. This is not always convenient for international users. For example, in Germany it's more typical to use comma as the decimal separator and dot as the thousand separator, in Russia thousand separator is space, while in India the digit grouping is not always by 3!
MySQL 5.5 extends the FORMAT() function to accept an optional third argument representing the desired region (locale). This example demonstrates the new feature:
mysql> SELECT 'en_US' AS locale, FORMAT(123457678.9, 3, 'en_US') AS format -> UNION SELECT 'de_DE', FORMAT(123457678.9, 3, 'de_DE') -> UNION SELECT 'en_IN', FORMAT(123457678.9, 3, 'en_IN') -> UNION SELECT 'ru_RU', FORMAT(123457678.9, 3, 'ru_RU'); +----------+---------------------+ | locale | format | +----------+---------------------+ | en_US | 123,457,678.900 | | de_DE | 123.457.678,900 | | en_IN | 12,34,57,678.900 | | ru_RU | 123 457 678,900 | +----------+---------------------+
Also, it's worth mentioning that in MySQL 5.5 we fixed two long-standing problems related to globalization.
The first problem was BINARY result data type of a string function when a numeric or a datetime value is passed as an argument. This problem appeared first in MySQL 4.1 when we introduced the BINARY data type. For queries such as "SELECT CONCAT('Date: ', date_column) FROM t1" or "SELECT CONCAT('Count: ', int_column) FROM t1", the CONCAT() operation returned VARBINARY/BINARY result set metadata, which was especially confusing for MySQL connectors, for example JDBC and ODBC drivers, PHP driver, etc.
Now the queries work much better and as expected. The result set metadata for CONCAT() correctly reports VARCHAR/CHAR metadata on the client side. The result of such CONCAT() operations can further be passed to other string functions, like LOWER() or UPPER(), and return anticipated results.
The second problem was incomplete implementation of internationalized error messages. Error messages were not often properly converted to the client character. In MySQL 5.5 we not only fixed all the conversion problems, but also simplified error message translation.
This task was perfectly coded by my colleague Sergey Glukhov, who is working from the same office with me here in Izhevsk. The error message template file is now entirely UTF-8. In earlier versions the template file consisted of a mixture of different character sets, which was very hard to edit.
Also, Sergey added support for positioned arguments. Printf-style positioned arguments are well known to C programmers. Now MySQL supports the same syntax in error message templates, which gives even more flexibility to translators.
Let's use this error message template as an example:
ER_WRONG_VALUE eng "Incorrect %-.32s value: '%-.128s'"
which can deliver this final error message:
"Incorrect DATE value: '2001-20-20'"
Note, the data type name is in the first place, and the value is in the second place. This sounds clear enough for English speakers, but in some languages it’s more natural to use a different order of words in this sentence, such as
"The value '2001-20-20' is not correct for DATE"
MySQL 5.5 made this possible. One can add a translated message into the message template file using positioned parameters, like this:
ER_WRONG_VALUE eng "Incorrect %-.32s value: '%-.128s'" mylang "The value '%2$-.128s' is not correct for %1$-.32s"
So, now the error message in English still uses the "old" order, while the translated message uses the opposite order.
I hope this extension will also give more fun and satisfaction to our potential translating contributors, and welcome our users to participate in translation. :)
Q: What have you done in MySQL 5.6 to improve internationalization support?
Bar: In MySQL 5.6 Development Milestone Release, we significantly improved collation customization routines. This task belongs to the internationalization (infrastructure) category. Namely, we improved support for so-called contractions, when two or more letters are sorted as a single letter, and expansions, when a single letter is sorted as a combination of two or more letters.
Earlier MySQL versions supported only contractions of two Basic Latin Letters, such as CH in Czech, and did not support expansion customization. Starting from MySQL 5.6 contractions are not limited to Basic Latin Letters any more, and expansion customization is also supported. Using these improvements we have already added Unicode collations for Croatian with support for the letter DŽ (a contraction), and German2 with support for rules Ä=AE, Ö=OE, Ü=UE (expansions).
MySQL 5.6 also includes some more collation customization improvements:
- long expansions and contractions (up to 6 characters)
- "reset before"
- long tailoring
- previous context, to support prolongation marks
These changes all open the doors for improvements in future MySQL releases, such as new language collations (Hungarian, Bengali, Myanmar, Maltese), and better support for Chinese, Japanese, and Korean.
Q: Are there any platform-specific improvements?
Bar: Yes. In MySQL 5.6 we made a good step towards better Windows integration.
For example, Windows native Unicode character set is UTF-16LE. MySQL 5.6 not only makes it possible to store data in UTF-16LE in your tables, but also retrieve data from MySQL server using UTF-16LE (even if the source table itself stores data in UTF-8 or in an 8-bit character set) and use the data further in a Unicode application without having to program conversion code on the application side.
Q: Has any of your work been in response to requests from the MySQL community? Have you incorporated community contributions into MySQL's globalization support?
Bar: We added two community-contributed patches recently to implement Sinhala collation and Romansh language support for month and week day names in datetime functions. I want to especially mention the Romansh patch, which was really good: it followed MySQL coding style very well, provided well-covering tests for MTR (the MySQL regression test system), and had references to the Romansh language authority standards and documents. So it only took us a few days from receiving the patch to pushing it into the official source tree.
Also, we added Greek language support for datetime functions, as well as Vietnamese collation. Patches for these two tasks were not fully made by the community contributors, but active participation of Greek and Vietnamese users, who were writing feature requests and providing language specific data, allowed us to complete these tasks faster.
Q: What other improvements have you made?
Bar: Some users may find a new WEIGHT_STRING() function useful. It returns a binary sortable image corresponding to the character string parameter, taking into account its collation. It's very close to what the strxfrm() function does in C.
One of the use cases of this function is to do client-side sorting. Let's use the following example. Suppose we create and populate a table with some international data:
CREATE TABLE t1 (s1 VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_unicode_ci); INSERT INTO t1 VALUES ('...'),('...'),('...'),...('...');
Then we query the table:
SELECT s1, WEIGHT_STRING(s1 COLLATE utf8_polish_ci), WEIGHT_STRING(s1 COLLATE utf8_german2_ci), WEIGHT_STRING(s1 COLLATE utf8_swedish_ci), WEIGHT_STRING(s1 COLLATE utf8_czech_ci) FROM t1 ORDER by s1;
When the results are fetched on the client side, they are ordered using utf8_unicode_ci, according to the collation of s1, which is given in the ORDER BY list.
But among the data itself, the client also has information about how to sort these strings according to Polish, German2, Swedish or Czech rules. The application now can use the data from the 2nd, 3rd, 4th, or 5th column to sort the 1st column according to desired rules. For example, the application can call the qsort() function, with memcmp() as the comparison routine, to order the strings from the result set as the application user chooses. This can be especially important in the case of a heavy SQL query, when re-executing the query with an alternative ORDER BY is not desirable.
Q: And finally, what are you working on now? What should MySQL users expect to see from you in the future?
Bar: We'll keep adding new language collations on top of the collation customization improvements we made recently, which, as I said earlier, made possible to implement collation for (including but not limited to) Bengali, Maltese, Myanmar, Chinese, Japanese, Korean.
Also, we're working on another improvement in collation routines, which support for so-called secondary and tertiary sorting levels, and it will provide even nicer ORDER BY results. For instance, even a case insensitive collation will be able to sort small letters before capital letters, so for example 'a' will always go before 'A' in ORDER BY, but at the same time 'a' and 'A' will still be equal to each other in WHERE condition. Now the order of 'a' and 'A' is not predictable for a case insensitive collation, but the secondary and tertiary sorting levels will solve this issue. In addition, we’ll enable stricter sorting order for accents in accent insensitive collations. This is important, for example, for French and Vietnamese.