WL#1349: Use operating system localization to send it as a default client character set
Affects: Server-5.5
—
Status: Complete
This task is a part of Sun Globalization Requirements, Locale and Character Set encoding identification. MySQL client tools will use the operating system localization information. Rationale ========= If the operating system is localized to some character set, in most cases the console application is configured to use this character set as well. Currently "mysql" and the other client tools use the compiled-in character set (latin1 in most binary distributions). The user gets a total mess with extended characters. Using the operating system's localization will help to start using MySQL more easily. Support also thinks this will save a lot of time for them. UNIX ==== Localization information is stored in the "LANG" environment variable. For example, on a non-Unicode Russian machine I have: $echo $LANG ru_RU.KOI8-R POSIX.1-2001 systems -------------------- Modern UNIX systems (Linux, FreeBSD) have a special nl_langinfo() function to help reading localization information. MySQL client tools will use nl_langinfo(CODESET) to get character set. Other systems ------------- There was a proposal for the systems not having nl_langinfo(): We could use getenv("LANG") to know the localization and extract the character set part manually. Note, sometimes locale name can have an optional "variant" part starting from the "@" character: fi_FI.ISO8859-15@euro The "variant" part should be removed before extracting character set. However, we would like to postpone this bit of non-trivial and error-prone work until it'll be found to be absolutely necessary. Systems not having nl_langinfo() will do nothing related to character set guessing. Windows ======= Windows provides a number of API functions to ask about code pages. GetLocaleInfo() --------------- GetLocaleInfo(LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTCODEPAGE, buf, sizeof(buf)); returns Operating System character set, in a string representing code page ID, for example "866". LOCALE_IDEFAULTCODEPAGE means "return DOS code page", which is suitable for console applications. The caller can request other types of code pages: LOCALE_IDEFAULTANSICODEPAGE returns ANSI code page (for a GUI application) LOCALE_IDEFAULTMACCODEPAGE return MAC code page (not sure what this is) but we won't use other types of code pages. UINT GetACP(VOID) ----------------- Retrieves the current Windows ANSI code page identifier for the operating system. http://msdn.microsoft.com/en-us/library/dd318070(VS.85).aspx Used for GUI applications. UINT GetOEMCP(VOID) ------------------- Retrieves the current OEM code-page identifier for the system. (OEM stands for original equipment manufacturer.) If the function succeeds, the return value is the current OEM code-page identifier for the system or a default identifier if no code page is current. http://msdn.microsoft.com/en-us/library/dd318114(VS.85).aspx Note: Is not sensible to "chcp". If you change the current console code page using "chcp" command, GetOEMCP() still returns the default code page. GetConsoleCP() -------------- Retrieves the input code page used by the console associated with the calling process. http://msdn.microsoft.com/en-us/library/ms683162(VS.85).aspx Note: sensible to "chcp" changes. GetConsoleOutputCP() -------------------- Retrieves the output code page used by the console associated with the calling process. A console uses its output code page to translate the character values written by the various output functions into the images displayed in the console window. http://msdn.microsoft.com/en-us/library/ms683169(VS.85).aspx Note: sensible to "chcp" changes. We will use GetConsoleCP() to extract the console code page of the current process. Standard client programs ======================== Detecting character set from the information provided by the operating system will be done only if the client character set is not given in my.cnf and not given in the command line option --default-character-set. This section is applicable for all the following standard client programs: mysql mysqladmin mysqlcheck mysqlimport mysqlshow Standard client programs that do not need character set detection ================================================================= mysqldump - it uses UTF8 by default, to be able to dump data of any character set. This won't change. User programs ============= User programs using MySQL API won't use OS information. This is because (in the case of Windows) libmysqlclient does not know if it was compiled with a console, or with a GUI application, and therefore does not know which code page to use (DOS or ANSI). Client-server protocol ====================== Nothing needs to be changed in client-server protocol. On handshake, a client program sends its character set to the server. It is one byte, MySQL internal numeric character set ID. This was implemented in 4.1. This procedure will remain the same. Adding two-byte IDs (which were implemented in 6.0) into client-server protocol is out of scope of this task. Translation from OS names to MySQL IDs ====================================== MySQL client programs will parse the LANG variable and translate from the non-Windows-operating-system name e.g. "ru_RU.KOI8-R" to a MySQL character set ID, or from the Windows locale ID e.g. "866" to a MySQL character set ID. Translation from the OS names and IDs will be done on the client side. This is required because the client side must also set its own internal variables to handle backslash correctly (which can be a part of a multi-byte character in some Asian character sets). There are several possible conversion results: 1. OS character set has an exact counterpart in MySQL. For example, Windows Code Page "850" is exactly the same as MySQL's cp850. 2. OS character set doesn't have a corresponding character set in MySQL. Client will issue a warning and proceed with the default compiled-in character set (typically latin1). 3. MySQL has a character set which is very similar to the OS character set, but is not exactly the same. For example, MySQL does not have cp949, but has euckr, which is a subset of cp949. Using euckr if OS character set = cp949 is much more useful than switching to the default character set (latin1). So in such cases we will use the similar character set, but will also issue a warning. A decision if there is an approximate MySQL character set which is "good enough" will be done for each OS character set individually. 4. OS character set has an exact counterpart in MySQL, but this character set is not supported by MySQL client. For example: UCS2, UTF16, UTF32. We will switch to the default character set (typically latin1) and issue a warning. Mapping between Windows code pages and MySQL character sets =========================================================== In the table below: ---- means "no mapping", as in case #2 in the previous section. [xxx] means "approximate" correspondence, case #3 {xxx} means "unsupported client character set", case #4 (xxx) means that we're planning to add this character set to MySQL in the near future, but currently it does not exist. We'll treat these cases similar to #2 at the moment, and will change mapping as soon as MySQL has the character set. xxx means "exact counterpart", as in case #1. ? means that this code page needs some more investigation before we choose between #3 and #4. (All "?"s should be removed before final architecture review is done.) MySQL Win CP IANA CSet Name Additional information ----- ------ -------------- ---------------------- [cp850] 437 IBM437 OEM United States ----- 708 ISO-8859-6 Arabic (ASMO 708, arabic, iso-8859-6) ----- 709 ASMO_449 Arabic (ASMO-449+, BCON V4) ----- 710 - Arabic - Transparent Arabic (DOS) ----- 720 DOS-720 Arabic (Transparent ASMO); Arabic (DOS) ----- 737 ibm737 OEM Greek (formerly 437G); Greek (DOS) ----- 775 ibm775 OEM Baltic; Baltic (DOS) cp850 850 ibm850 OEM Multilingual Latin 1 (DOS) cp852 852 ibm852 OEM Latin 2; Central European (DOS) ----- 855 IBM855 OEM Cyrillic (primarily Russian) ----- 857 ibm857 OEM Turkish; Turkish (DOS) [cp850] 858 IBM00858 OEM Multilingual Latin 1 + Euro symbol ----- 860 IBM860 OEM Portuguese; Portuguese (DOS) ----- 861 ibm861 OEM Icelandic; Icelandic (DOS) ----- 862 DOS-862 OEM Hebrew; Hebrew (DOS) ----- 863 IBM863 OEM French Canadian; French Canadian (DOS) ----- 864 IBM864 OEM Arabic; Arabic (864) ----- 865 IBM865 OEM Nordic; Nordic (DOS) cp866 866 cp866 OEM Russian; Cyrillic (DOS) ----- 869 ibm869 OEM Modern Greek; Greek, Modern (DOS) [tis620] 874 windows-874 ANSI/OEM Thai (same as 28605, ISO 8859-15) cp932 932 shift_jis ANSI/OEM Japanese; Japanese (Shift-JIS) [gbk] 936 gbk ANSI/OEM Simplified Chinese(PRC,Singapore) [euckr] 949 ks_c_5601-1987 ANSI/OEM Korean (Unified Hangul Code) big5 950 big5 ANSI/OEM Traditional Chinese {utf16le} 1200 utf-16 Unicode UTF-16, little-endian byte order {utf16} 1201 unicodeFFFE Unicode UTF-16, big-endian byte order. latin2 1250 windows-1250 ANSI Central European cp1251 1251 windows-1251 ANSI Cyrillic; Cyrillic (Windows) latin1 1252 windows-1252 ANSI Latin 1; Western European (Windows) greek 1253 windows-1253 ANSI Greek; Greek (Windows) latin5 1254 windows-1254 ANSI Turkish; Turkish (Windows) [hebrew] 1255 windows-1255 ANSI Hebrew; Hebrew (Windows) cp1256 1256 windows-1256 ANSI Arabic; Arabic (Windows) cp1257 1257 windows-1257 ANSI Baltic; Baltic (Windows) ----- 1258 windows-1258 ANSI/OEM Vietnamese; Vietnamese (Windows) ----- 1361 Johab Korean (Johab) macroman 10000 macintosh MAC Roman; Western European (Mac) [sjis] 10001 x-mac-japanese Japanese (Mac) [big5] 10002 x-mac-chinesetrad MAC Traditional Chinese (Big5) ----- 10003 x-mac-korean Korean (Mac) ----- 10004 x-mac-arabic Arabic (Mac) ----- 10005 x-mac-hebrew Hebrew (Mac) ----- 10006 x-mac-greek Greek (Mac) ----- 10007 x-mac-cyrillic Cyrillic (Mac) [gb2312] 10008 x-mac-chinesesimp MAC Simplified Chinese (GB 2312) ----- 10010 x-mac-romanian Romanian (Mac) ----- 10017 x-mac-ukrainian Ukrainian (Mac) [tis620] 10021 x-mac-thai Thai (Mac) macce 10029 x-mac-ce MAC Latin 2; Central European (Mac) ----- 10079 x-mac-icelandic Icelandic (Mac) ----- 10081 x-mac-turkish Turkish (Mac) ----- 10082 x-mac-croatian Croatian (Mac) ----- 12000 utf-32 Unicode UTF-32, little-endian byte order {utf32} 12001 utf-32BE Unicode UTF-32, big-endian byte order ----- 20000 x-Chinese_CNS CNS Taiwan; Chinese Traditional (CNS) ----- 20001 x-cp20001 TCA Taiwan ----- 20002 x_Chinese-Eten Eten Taiwan; Chinese Traditional (Eten) ----- 20003 x-cp20003 IBM5550 Taiwan ----- 20004 x-cp20004 TeleText Taiwan ----- 20005 x-cp20005 Wang Taiwan ----- 20105 x-IA5 IA5 (IRV International Alphabet No.5 7bit ----- 20106 x-IA5-German IA5 German (7-bit) swe7 20107 x-IA5-Swedish IA5 Swedish (7-bit) ----- 20108 x-IA5-Norwegian IA5 Norwegian (7-bit) ascii 20127 us-ascii US-ASCII (7-bit) ----- 20261 x-cp20261 T.61 ----- 20269 x-cp20269 ISO 6937 Non-Spacing Accent koi8r 20866 koi8-r Russian (KOI8-R); Cyrillic (KOI8-R) ujis 20932 EUC-JP Japanese (JIS 0208-1990 and 0121-1990) [gb2312] 20936 x-cp20936 Simplified Chinese (GB2312) [euckr] 20949 x-cp20949 Korean Wansung koi8u 21866 koi8-u Ukrainian (KOI8-U) [latin1] 28591 iso-8859-1 ISO 8859-1 Latin 1; Western European (ISO) latin2 28592 iso-8859-2 ISO 8859-2 Central European ----- 28593 iso-8859-3 ISO 8859-3 Latin 3 ----- 28594 iso-8859-4 ISO 8859-4 Baltic ----- 28595 iso-8859-5 ISO 8859-5 Cyrillic ----- 28596 iso-8859-6 ISO 8859-6 Arabic greek 28597 iso-8859-7 ISO 8859-7 Greek hebrew 28598 iso-8859-8 ISO 8859-8 Hebrew; Hebrew (ISO-Visual) latin5 28599 iso-8859-9 ISO 8859-9 Turkish latin7 28603 iso-8859-13 ISO 8859-13 Estonian (latin9) 28605 iso-8859-15 ISO 8859-15 Latin 9 ----- 29001 x-Europa Europa 3 hebrew 38598 iso-8859-8-i ISO 8859-8 Hebrew; Hebrew (ISO-Logical) ----- 50220 iso-2022-jp ISO 2022 Japanese, no halfwidth Katakana ----- 50221 csISO2022JP ISO 2022 Japanese, with halfwidth Katakana ----- 50222 iso-2022-jp ISO 2022 Japanese JIS X 0201-1989 ----- 50225 iso-2022-kr ISO 2022 Korean ----- 50227 x-cp50227 ISO 2022 Simplified Chinese ----- 50229 - ISO 2022 Traditional Chinese ujis 51932 euc-jp EUC Japanese gb2312 51936 EUC-CN EUC Simplified Chinese (EUC) euckr 51949 euc-kr EUC Korean big5 51950 - EUC Traditional Chinese ----- 52936 hz-gb-2312 HZ-GB2312 Simplified Chinese (HZ) (7-bit) (gb18030) 54936 GB18030 XP and later: Simplified Chinese (4 byte) ---- 57002 x-iscii-de ISCII Devanagari ---- 57003 x-iscii-be ISCII Bengali ---- 57004 x-iscii-ta ISCII Tamil ---- 57005 x-iscii-te ISCII Telugu ---- 57006 x-iscii-as ISCII Assamese ---- 57007 x-iscii-or ISCII Oriya ---- 57008 x-iscii-ka ISCII Kannada ---- 57009 x-iscii-ma ISCII Malayalam ---- 57010 x-iscii-gu ISCII Gujarati ---- 57011 x-iscii-pa ISCII Punjabi ---- 65000 utf-7 Unicode (UTF-7) utf8 65001 utf-8 Unicode (UTF-8) Notes: - MySQL [cp850] differs from Windows cp437 for 43 characters in the range 0x80..0xFF. - MySQL [cp850] is compatible with Windows cp858 except that cp858 replaces Turkish letter "DOTLESS SMALL I" at position 0xD5 with "EURO SIGN". - MySQL [latin5] is compatible with cp1254, except for the 0x80-0x9F range. 0x80-0x9F is not assigned in latin5 0x80-0x9F has some punctuation characters in cp1254 ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00850.txt ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00858.txt - MySQL [hebrew] (also known as iso-8859-8) is compatible with Windows cp1255 in the alphabetic range 0xE0-0xFA. Windows cp1255 has additionally: * general purpose punctuation characters in the range 0x80-0x9F * Hebrew punctuation characters in the range 0xC0-0xDF. - MySQL [tis620] is compatible with Windows cp874 in the alphanumeric range 0xA1-0xFF. * tis-620 maps the range 0x80-0x9F to control characters U0080-U009F * cp874 uses 0x80-0x9F for some punctuation characters (9), and for EURO SIGN. - MySQL [euckr] is a subset of cp949 CP949 additionally encodes: * 0x80 0x0080 #* 0xA2E6 0x20AC #EURO SIGN * 0xA2E7 0x00AE #REGISTERED SIGN * 0xFF 0xF8F7 # * 0xC9A1..0xC9FE mapped to U+E000..E05D (94 characters) * 0xFEA0..0xFEFE mapped to U+E05E..E0BB (94 characters) - MySQL [euckr] is close to "Code Page 20949 Korean Wansung" * CP20949 maps 0x80-0x9F to U+0x0080-U+0x009F (unused in euckr) * euckr additionally uses [81-A0][41-FE] and [A1-FE][41-A0] (unused in 20949) - MySQL [latin1] is a superset of Windows Code Page 28591 (which is iso-8859-1) - MySQL [tis620] is close to x-mac-thai * tis-620 maps 0x80-x9F to control characters U0080-U009F * x-mac-thai uses 0x80-x9F for some additional Apple-specific Thai characters from Unicode private use area. - MySQL [sjis] is close to x-mac-japanese (Code Page 10001) * Difference is in 7 punctuation characters * Mac-Japanese has two additional blocks 0x8540-0x886D and 0xEB41-0xED96 with rare digit forms, circled forms, parenthesized forms, vertical forms, presentation forms. - MySQL [big5] is close to x-mac-chinesetrad (Code Page 10002) Difference: * Apple introduces 7 additional single byte characters * Apple maps 0x83..0x9F - undefined, map to U0083-U009F * Different mapping of 26 characters in the range 0xA??? * Big5 encodes 249 additional characters in the range 0xC??? (KATAKANA, HIRAGANA, CYRILLIC, CIRCLED and PARENTHESIZED DIGITs) - MySQL [gb2312] is close to x-mac-chinesesimp (Code Page 10008) Difference: * Apple introduces 7 additional single byte characters * Apple maps 0x83..0x9F - undefined, map to U0083-U009F * Difference in 8 characters in 0xA??? * Apple has 29 additional characters (vertical forms) in 0xA6?? * Apple has 6 additional characters (pinyin) in 0xA8?? - MySQL [gb2312] is close to x-cp20936 (Code Page 20936) Difference: * gb2312 uses 0xA1AC for "0x2016 #DOUBLE VERTICAL LINE" * x-cp20936 uses 0x80 for "0x0080 # " * x-co20936 maps 41 single byte characters 0x81..0x9F, 0xA0, 0xF8-FF to range 0xF8D8..0xF8FF. - MySQL [gbk] is close to Code Page 936. Difference: cp936 encodes 0x80 as 0x20AC #EURO SIGN cp936 encodes 0xFF as 0xF8F5 # cp936 has additional 2149 characters in the range [A1..FE]?? Mapping between Unix locale charmaps and MySQL character sets ============================================================= MySQL Unix ----- ---- cp1251 ansi1251 armscii8 armscii8 ascii ASCII big5 Big5 ----- big5hkscs cp1251 cp1251 cp1255 cp1255 cp866 CP866 gb2312 eucCN ujis eucJP euckr eucKR ----- euctw (gb18030) gb18030 gb2312 gb2312 gbk gbk geostd8 georgianps cp1252 IBM-1252 ----- ISCII-DEV [latin1] iso88591 [latin1] ISO_8859-1 [latin1] ISO8859-1 [latin1] ISO-8859-1 ----- iso885910 latin7 iso885913 latin7 ISO_8859-13 latin7 ISO8859-13 latin7 ISO-8859-13 ----- iso885914 (latin9) iso885915 (latin9) ISO_8859-15 (latin9) ISO8859-15 (latin9) ISO-8859-15 latin2 iso88592 latin2 ISO_8859-2 latin2 ISO8859-2 latin2 ISO-8859-2 ----- iso88593 ----- ISO_8859-4 ----- ISO8859-4 ----- iso88595 ----- ISO_8859-5 ----- ISO8859-5 ----- iso88596 greek iso88597 greek ISO_8859-7 greek ISO8859-7 greek ISO-8859-7 hebrew iso88598 hebrew ISO_8859-8 hebrew ISO8859-8 hebrew ISO-8859-8 latin5 iso88599 latin5 ISO_8859-9 latin5 ISO8859-9 latin5 ISO-8859-9 koi8r koi8r koi8r KOI8-R ----- koi8t koi8u koi8u koi8u KOI8-U ----- pt154 sjis Shift_JIS sjis SJIS ----- tcvn tis620 tis620 ujis ujis ascii US-ASCII utf8 utf8 Note: mapping should be done case insensitively. For example: sjis == SJIS eucJP == EUCJP == eucjp References ========== - GetLocaleInfo: http://msdn.microsoft.com/en-us/library/ms960719.aspx - LCTYPE Constants http://msdn.microsoft.com/en-us/library/ms957134.aspx - Windows code pages http://www.microsoft.com/globaldev/reference/WinCP.mspx http://www.microsoft.com/globaldev/reference/oem.mspx - Mac encodings http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/ http://www.opensource.apple.com/darwinsource/WWDC2004/tcl-14/tcl/tools/encoding/ - Swedish 7-bit Alphabet http://en.wikipedia.org/wiki/ISO/IEC_646 - Coded character sets and Encodings (Debian) http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html - Discussion about Win API code page functions: http://lists.gnupg.org/pipermail/gnupg-users/2005-July/026189.html
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.