WL#1349: Use operating system localization to send it as a default client character set

Affects: Server-5.5   —   Status: Complete

This task is a part of Sun Globalization Requirements,
Locale and Character Set encoding identification.

MySQL client tools will use the operating system
localization information.

Rationale
=========
If the operating system is localized to some character set,
in most cases the console application is configured to use this
character set as well.

Currently "mysql" and the other client tools use the
compiled-in character set (latin1 in most binary 
distributions). The user gets a total mess with extended characters.

Using the operating system's localization will help
to start using MySQL more easily.

Support also thinks this will save a lot of time for them.


UNIX
====
Localization information is stored in the "LANG" environment variable.
For example, on a non-Unicode Russian machine I have:

$echo $LANG
ru_RU.KOI8-R

POSIX.1-2001 systems
--------------------
Modern UNIX systems (Linux, FreeBSD) have a special
nl_langinfo() function to help reading localization
information.

MySQL client tools will use nl_langinfo(CODESET)
to get character set.

Other systems
-------------
There was a proposal for the systems not having nl_langinfo():

We could use getenv("LANG") to know the localization
and extract the character set part manually.
Note, sometimes locale name can have an optional "variant" part
starting from the "@" character:
  fi_FI.ISO8859-15@euro
The "variant" part should be removed before extracting character set.

However, we would like to postpone this bit of non-trivial and error-prone
work until it'll be found to be absolutely necessary.
Systems not having nl_langinfo() will do nothing related to
character set guessing.

Windows
=======
Windows provides a number of API functions to ask about code pages.

GetLocaleInfo()
---------------
GetLocaleInfo(LOCALE_SYSTEM_DEFAULT, LOCALE_IDEFAULTCODEPAGE, buf, sizeof(buf));

returns Operating System character set, in a string
representing code page ID, for example "866".

LOCALE_IDEFAULTCODEPAGE means "return DOS code page",
which is suitable for console applications.

The caller can request other types of code pages:
LOCALE_IDEFAULTANSICODEPAGE returns ANSI code page (for a GUI application)
LOCALE_IDEFAULTMACCODEPAGE  return MAC code page (not sure what this is)
but we won't use other types of code pages.


UINT GetACP(VOID) 
-----------------
Retrieves the current Windows ANSI code page
identifier for the operating system.
http://msdn.microsoft.com/en-us/library/dd318070(VS.85).aspx

Used for GUI applications.


UINT GetOEMCP(VOID)
-------------------
Retrieves the current OEM code-page identifier for the system.
(OEM stands for original equipment manufacturer.)
If the function succeeds, the return value is the current OEM code-page
identifier for the system or a default identifier if no code page is current. 

http://msdn.microsoft.com/en-us/library/dd318114(VS.85).aspx

Note: Is not sensible to "chcp".
If you change the current console code page using "chcp" command,
GetOEMCP() still returns the default code page.


GetConsoleCP()
--------------
Retrieves the input code page used by the console associated
with the calling process.

http://msdn.microsoft.com/en-us/library/ms683162(VS.85).aspx

Note: sensible to "chcp" changes.


GetConsoleOutputCP()
--------------------
Retrieves the output code page used by the console associated with the calling 
process. A console uses its output code page to translate the character values 
written by the various output functions into the images displayed in the 
console window.

http://msdn.microsoft.com/en-us/library/ms683169(VS.85).aspx

Note: sensible to "chcp" changes.



We will use GetConsoleCP() to extract the console code page
of the current process.


Standard client programs
========================
Detecting character set from the information
provided by the operating system will be done only
if the client character set is not given
in my.cnf and not given in the command line
option --default-character-set.

This section is applicable for all the following standard client programs:

mysql
mysqladmin
mysqlcheck
mysqlimport
mysqlshow

Standard client programs that do not need character set detection
=================================================================
mysqldump - it uses UTF8 by default, to be able to dump data of
any character set. This won't change.


User programs
=============
User programs using MySQL API won't use OS information.
This is because (in the case of Windows) libmysqlclient does not
know if it was compiled with a console, or with a GUI application,
and therefore does not know which code page to use (DOS or ANSI).


Client-server protocol
======================
Nothing needs to be changed in client-server protocol.
On handshake, a client program sends its character set to the server.
It is one byte, MySQL internal numeric character set ID.
This was implemented in 4.1. This procedure will remain the same.

Adding two-byte IDs (which were implemented in 6.0) into client-server
protocol is out of scope of this task.


Translation from OS names to MySQL IDs
======================================
MySQL client programs will parse the LANG variable and translate
from the non-Windows-operating-system name e.g. "ru_RU.KOI8-R" to
a MySQL character set ID, or from the Windows locale ID e.g. "866"
to a MySQL character set ID.

Translation from the OS names and IDs will be done on the client side.
This is required because the client side must also set its own
internal variables to handle backslash correctly (which can be
a part of a multi-byte character in some Asian character sets).

There are several possible conversion results:

1. OS character set has an exact counterpart in MySQL.
For example, Windows Code Page "850" is exactly the same
as MySQL's cp850.

2. OS character set doesn't have a corresponding character
set in MySQL. Client will issue a warning and proceed
with the default compiled-in character set (typically latin1).

3. MySQL has a character set which is very similar to the
OS character set, but is not exactly the same.
For example,
MySQL does not have cp949, but has euckr,
which is a subset of cp949. Using euckr if OS character set = cp949
is much more useful than switching to the default character
set (latin1). So in such cases we will use the similar 
character set, but will also issue a warning.
A decision if there is an approximate MySQL character set
which is "good enough" will be done for each OS character
set individually.

4. OS character set has an exact counterpart in MySQL,
but this character set is not supported by MySQL client. For example:
UCS2, UTF16, UTF32. We will switch to the default
character set (typically latin1) and issue a warning.


Mapping between Windows code pages and MySQL character sets
===========================================================

In the table below:

----  means "no mapping", as in case #2 in the previous section.
[xxx] means "approximate" correspondence, case #3
{xxx} means "unsupported client character set", case #4
(xxx) means that we're planning to add this character set to MySQL
      in the near future, but currently it does not exist.
      We'll treat these cases similar to #2 at the moment,
      and will change mapping as soon as MySQL has the character set.
xxx   means "exact counterpart", as in case #1.
?     means that this code page needs some more investigation
      before we choose between #3 and #4.
      (All "?"s should be removed before final architecture review is done.)



MySQL  Win CP     IANA CSet Name     Additional information
-----  ------      --------------     ----------------------
[cp850]   437      IBM437             OEM United States
-----     708      ISO-8859-6         Arabic (ASMO 708, arabic, iso-8859-6)
-----     709      ASMO_449           Arabic (ASMO-449+, BCON V4)
-----     710      -                  Arabic - Transparent Arabic (DOS)
-----     720      DOS-720            Arabic (Transparent ASMO); Arabic (DOS)
-----     737      ibm737             OEM Greek (formerly 437G); Greek (DOS)
-----     775      ibm775             OEM Baltic; Baltic (DOS)
cp850     850      ibm850             OEM Multilingual Latin 1 (DOS)
cp852     852      ibm852             OEM Latin 2; Central European (DOS)
-----     855      IBM855             OEM Cyrillic (primarily Russian)
-----     857      ibm857             OEM Turkish; Turkish (DOS)
[cp850]   858      IBM00858           OEM Multilingual Latin 1 + Euro symbol
-----     860      IBM860             OEM Portuguese; Portuguese (DOS)
-----     861      ibm861             OEM Icelandic; Icelandic (DOS)
-----     862      DOS-862            OEM Hebrew; Hebrew (DOS)
-----     863      IBM863             OEM French Canadian; French Canadian (DOS)
-----     864      IBM864             OEM Arabic; Arabic (864)
-----     865      IBM865             OEM Nordic; Nordic (DOS)
cp866     866      cp866              OEM Russian; Cyrillic (DOS)
-----     869      ibm869             OEM Modern Greek; Greek, Modern (DOS)
[tis620]  874      windows-874        ANSI/OEM Thai (same as 28605, ISO 8859-15)
cp932     932      shift_jis          ANSI/OEM Japanese; Japanese (Shift-JIS)
[gbk]     936      gbk                ANSI/OEM Simplified Chinese(PRC,Singapore)
[euckr]   949      ks_c_5601-1987     ANSI/OEM Korean (Unified Hangul Code)
big5      950      big5               ANSI/OEM Traditional Chinese
{utf16le} 1200     utf-16             Unicode UTF-16, little-endian byte order
{utf16}   1201     unicodeFFFE        Unicode UTF-16, big-endian byte order.
latin2    1250     windows-1250       ANSI Central European
cp1251    1251     windows-1251       ANSI Cyrillic; Cyrillic (Windows)
latin1    1252     windows-1252       ANSI Latin 1; Western European (Windows)
greek     1253     windows-1253       ANSI Greek; Greek (Windows)
latin5    1254     windows-1254       ANSI Turkish; Turkish (Windows)
[hebrew]  1255     windows-1255       ANSI Hebrew; Hebrew (Windows)
cp1256    1256     windows-1256       ANSI Arabic; Arabic (Windows)
cp1257    1257     windows-1257       ANSI Baltic; Baltic (Windows)
-----     1258     windows-1258       ANSI/OEM Vietnamese; Vietnamese (Windows)
-----     1361     Johab              Korean (Johab)
macroman  10000    macintosh          MAC Roman; Western European (Mac)
[sjis]    10001    x-mac-japanese     Japanese (Mac)
[big5]    10002    x-mac-chinesetrad  MAC Traditional Chinese (Big5)
-----     10003    x-mac-korean       Korean (Mac)
-----     10004    x-mac-arabic       Arabic (Mac)
-----     10005    x-mac-hebrew       Hebrew (Mac)
-----     10006    x-mac-greek        Greek (Mac)
-----     10007    x-mac-cyrillic     Cyrillic (Mac)
[gb2312]  10008    x-mac-chinesesimp  MAC Simplified Chinese (GB 2312)
-----     10010    x-mac-romanian     Romanian (Mac)
-----     10017    x-mac-ukrainian    Ukrainian (Mac)
[tis620]  10021    x-mac-thai         Thai (Mac)
macce     10029    x-mac-ce           MAC Latin 2; Central European (Mac)
-----     10079    x-mac-icelandic    Icelandic (Mac)
-----     10081    x-mac-turkish      Turkish (Mac)
-----     10082    x-mac-croatian     Croatian (Mac)
-----     12000    utf-32             Unicode UTF-32, little-endian byte order
{utf32}   12001    utf-32BE           Unicode UTF-32, big-endian byte order
-----     20000    x-Chinese_CNS      CNS Taiwan; Chinese Traditional (CNS)
-----     20001    x-cp20001          TCA Taiwan
-----     20002    x_Chinese-Eten     Eten Taiwan; Chinese Traditional (Eten)
-----     20003    x-cp20003          IBM5550 Taiwan
-----     20004    x-cp20004          TeleText Taiwan
-----     20005    x-cp20005          Wang Taiwan
-----     20105    x-IA5              IA5 (IRV International Alphabet No.5 7bit
-----     20106    x-IA5-German       IA5 German (7-bit)
swe7      20107    x-IA5-Swedish      IA5 Swedish (7-bit)
-----     20108    x-IA5-Norwegian    IA5 Norwegian (7-bit)
ascii     20127    us-ascii           US-ASCII (7-bit)
-----     20261    x-cp20261          T.61
-----     20269    x-cp20269          ISO 6937 Non-Spacing Accent
koi8r     20866    koi8-r             Russian (KOI8-R); Cyrillic (KOI8-R)
ujis      20932    EUC-JP             Japanese (JIS 0208-1990 and 0121-1990)
[gb2312]  20936    x-cp20936          Simplified Chinese (GB2312)
[euckr]   20949    x-cp20949          Korean Wansung
koi8u     21866    koi8-u             Ukrainian (KOI8-U)
[latin1]  28591    iso-8859-1         ISO 8859-1 Latin 1; Western European (ISO)
latin2    28592    iso-8859-2         ISO 8859-2 Central European
-----     28593    iso-8859-3         ISO 8859-3 Latin 3
-----     28594    iso-8859-4         ISO 8859-4 Baltic
-----     28595    iso-8859-5         ISO 8859-5 Cyrillic
-----     28596    iso-8859-6         ISO 8859-6 Arabic
greek     28597    iso-8859-7         ISO 8859-7 Greek
hebrew    28598    iso-8859-8         ISO 8859-8 Hebrew; Hebrew (ISO-Visual)
latin5    28599    iso-8859-9         ISO 8859-9 Turkish
latin7    28603    iso-8859-13        ISO 8859-13 Estonian
(latin9)  28605    iso-8859-15        ISO 8859-15 Latin 9
-----     29001    x-Europa           Europa 3
hebrew    38598    iso-8859-8-i       ISO 8859-8 Hebrew; Hebrew (ISO-Logical)
-----     50220    iso-2022-jp        ISO 2022 Japanese, no halfwidth Katakana
-----     50221    csISO2022JP        ISO 2022 Japanese, with halfwidth Katakana
-----     50222    iso-2022-jp        ISO 2022 Japanese JIS X 0201-1989
-----     50225    iso-2022-kr        ISO 2022 Korean
-----     50227    x-cp50227          ISO 2022 Simplified Chinese
-----     50229    -                  ISO 2022 Traditional Chinese
ujis      51932    euc-jp             EUC Japanese
gb2312    51936    EUC-CN             EUC Simplified Chinese (EUC)
euckr     51949    euc-kr             EUC Korean
big5      51950    -                  EUC Traditional Chinese
-----     52936    hz-gb-2312         HZ-GB2312 Simplified Chinese (HZ) (7-bit)
(gb18030) 54936    GB18030            XP and later: Simplified Chinese (4 byte)
----      57002    x-iscii-de         ISCII Devanagari
----      57003    x-iscii-be         ISCII Bengali
----      57004    x-iscii-ta         ISCII Tamil
----      57005    x-iscii-te         ISCII Telugu
----      57006    x-iscii-as         ISCII Assamese
----      57007    x-iscii-or         ISCII Oriya
----      57008    x-iscii-ka         ISCII Kannada
----      57009    x-iscii-ma         ISCII Malayalam
----      57010    x-iscii-gu         ISCII Gujarati
----      57011    x-iscii-pa         ISCII Punjabi
----      65000    utf-7              Unicode (UTF-7)
utf8      65001    utf-8              Unicode (UTF-8)



Notes:

- MySQL [cp850] differs from Windows cp437 for 43 characters
  in the range 0x80..0xFF.

- MySQL [cp850] is compatible with Windows cp858 except 
  that cp858 replaces Turkish letter "DOTLESS SMALL I" at
  position 0xD5 with "EURO SIGN".

- MySQL [latin5] is compatible with cp1254, except for the 0x80-0x9F range.
  0x80-0x9F is not assigned in latin5
  0x80-0x9F has some punctuation characters in cp1254

  ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00850.txt
  ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00858.txt

- MySQL [hebrew] (also known as iso-8859-8) is compatible with Windows cp1255
  in the alphabetic range 0xE0-0xFA.
  Windows cp1255 has additionally:
  * general purpose punctuation  characters in the range 0x80-0x9F
  * Hebrew punctuation characters in the range 0xC0-0xDF.

- MySQL [tis620] is compatible with Windows cp874 in the alphanumeric range
  0xA1-0xFF.
  * tis-620 maps the range 0x80-0x9F to control characters U0080-U009F
  * cp874 uses 0x80-0x9F for some punctuation characters (9),
    and for EURO SIGN.

- MySQL [euckr] is a subset of cp949
  CP949 additionally encodes:

  * 0x80    0x0080  #
  * 0xA2E6  0x20AC  #EURO SIGN
  * 0xA2E7  0x00AE  #REGISTERED SIGN
  * 0xFF    0xF8F7  #
  * 0xC9A1..0xC9FE mapped to  U+E000..E05D (94 characters)
  * 0xFEA0..0xFEFE mapped to  U+E05E..E0BB (94 characters)

- MySQL [euckr] is close to "Code Page 20949 Korean Wansung"
  * CP20949 maps 0x80-0x9F to  U+0x0080-U+0x009F (unused in euckr)
  * euckr additionally uses [81-A0][41-FE] and [A1-FE][41-A0] (unused in 20949)

- MySQL [latin1] is a superset of Windows
  Code Page 28591 (which is iso-8859-1)

- MySQL [tis620] is close to x-mac-thai
  * tis-620 maps 0x80-x9F to control characters U0080-U009F
  * x-mac-thai uses 0x80-x9F for some additional 
    Apple-specific Thai characters from Unicode private use area.

- MySQL [sjis] is close to x-mac-japanese (Code Page 10001)
  * Difference is in 7 punctuation characters
  * Mac-Japanese has two additional blocks 
    0x8540-0x886D and 0xEB41-0xED96 with rare
    digit forms, circled forms, parenthesized forms,
    vertical forms, presentation forms.

- MySQL [big5] is close to x-mac-chinesetrad (Code Page 10002)
  Difference:
  * Apple introduces 7 additional single byte characters
  * Apple maps 0x83..0x9F - undefined, map to U0083-U009F
  * Different mapping of 26 characters in the range 0xA???
  * Big5 encodes 249 additional characters in the range 0xC???
    (KATAKANA, HIRAGANA, CYRILLIC, CIRCLED and PARENTHESIZED DIGITs)

- MySQL [gb2312] is close to x-mac-chinesesimp (Code Page 10008)
  Difference:
  * Apple introduces 7 additional single byte characters
  * Apple maps 0x83..0x9F - undefined, map to U0083-U009F
  * Difference in 8 characters in 0xA???
  * Apple has 29 additional characters (vertical forms) in 0xA6??
  * Apple has 6 additional characters (pinyin) in 0xA8??

- MySQL [gb2312] is close to x-cp20936 (Code Page 20936)
  Difference:
  * gb2312 uses 0xA1AC for "0x2016 #DOUBLE VERTICAL LINE"
  * x-cp20936 uses 0x80 for "0x0080  #"
  * x-co20936 maps 41 single byte characters 
     0x81..0x9F, 0xA0, 0xF8-FF to  range 0xF8D8..0xF8FF.

- MySQL [gbk] is close to Code Page 936.
  Difference:
  cp936 encodes 0x80 as 0x20AC  #EURO SIGN
  cp936 encodes 0xFF as 0xF8F5  #
  cp936 has additional 2149 characters in the range [A1..FE]??


Mapping between Unix locale charmaps and MySQL character sets
=============================================================


MySQL     Unix
-----     ----
cp1251    ansi1251
armscii8  armscii8
ascii     ASCII
big5      Big5
-----     big5hkscs
cp1251    cp1251
cp1255    cp1255
cp866     CP866
gb2312    eucCN
ujis      eucJP
euckr     eucKR
-----     euctw
(gb18030) gb18030
gb2312    gb2312
gbk       gbk
geostd8   georgianps
cp1252    IBM-1252
-----     ISCII-DEV
[latin1]  iso88591
[latin1]  ISO_8859-1
[latin1]  ISO8859-1
[latin1]  ISO-8859-1
-----     iso885910
latin7    iso885913
latin7    ISO_8859-13
latin7    ISO8859-13
latin7    ISO-8859-13
-----     iso885914
(latin9)  iso885915
(latin9)  ISO_8859-15
(latin9)  ISO8859-15
(latin9)  ISO-8859-15
latin2    iso88592
latin2    ISO_8859-2
latin2    ISO8859-2
latin2    ISO-8859-2
-----     iso88593
-----     ISO_8859-4
-----     ISO8859-4
-----     iso88595
-----     ISO_8859-5
-----     ISO8859-5
-----     iso88596
greek     iso88597
greek     ISO_8859-7
greek     ISO8859-7
greek     ISO-8859-7
hebrew    iso88598
hebrew    ISO_8859-8
hebrew    ISO8859-8
hebrew    ISO-8859-8
latin5    iso88599
latin5    ISO_8859-9
latin5    ISO8859-9
latin5    ISO-8859-9
koi8r     koi8r
koi8r     KOI8-R
-----     koi8t
koi8u     koi8u
koi8u     KOI8-U
-----     pt154
sjis      Shift_JIS
sjis      SJIS
-----     tcvn
tis620    tis620
ujis      ujis
ascii     US-ASCII
utf8      utf8


Note: mapping should be done case insensitively.
For example:

 sjis == SJIS
 eucJP == EUCJP == eucjp



References
==========
- GetLocaleInfo:
  http://msdn.microsoft.com/en-us/library/ms960719.aspx 
- LCTYPE Constants
  http://msdn.microsoft.com/en-us/library/ms957134.aspx
- Windows code pages
  http://www.microsoft.com/globaldev/reference/WinCP.mspx
  http://www.microsoft.com/globaldev/reference/oem.mspx
- Mac encodings
  http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/
  http://www.opensource.apple.com/darwinsource/WWDC2004/tcl-14/tcl/tools/encoding/

- Swedish 7-bit Alphabet
  http://en.wikipedia.org/wiki/ISO/IEC_646

- Coded character sets and Encodings (Debian)
  http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html

- Discussion about Win API code page functions:

http://lists.gnupg.org/pipermail/gnupg-users/2005-July/026189.html