WL#3997: New euckr characters

Affects: Server-5.1 — Status: Complete

Description
High Level Architecture

Add these characters to euckr:
Euro Sign
Registered Sign
Perhaps also add this character:
Circled Hangul Ieung U (i.e. "Postal code mark")

MySQL's euckr character set is ultimately based on
the Korean Standard KS C 5601-1992. But the Korean
authorities have added three new characters since
then. All three characters are in the modern 'euckr'
(as seen in iconv); two of the three are definitely
in cp949 (the Microsoft variant of euckr).

euckr  ucs2 encoding KS version name
------ ------------- ---------- ----
0xa2e6 U+20AC        1998       EURO SIGN
0xa2e7 U+00AE        1998       REGISTERED SIGN
0xa2e8 U+327E        2002       CIRCLED HANGUL IEUNG U

These characters are not in the KS C 5601 Unicode mapping table
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT
But EURO SIGN and REGISTERED SIGN are in the cp949 Unicode mapping table
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
And MySQL has always supported the 8822 pre-combined hangul
(syllabic block) characters which are not in strict
KS C 5601 (they are only in an Annex) but are in cp949.
In other words, MySQL really supports cp949, let's make it complete
by adding EURO SIGN and REGISTERED SIGN.

The third character, CIRCLED HANGUL IEUNG U, informally known as
"Korean Postal Code Mark", is also in the Korean Standard now.
But it is not listed in the cp949 Unicode mapping table.
So we did more research and decided not to add CIRCLED HANGUL IEUNG U.

Research for CIRCLED HANGUL IEUNG U
-----------------------------------

This says that Korean postal code mark = circled hangul ieung u:
std.dkuug.dk/jtc1/sc2/wg2/docs/N2815.doc
This says that circled hangul ieung u is in BMP, as U+327e:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
This says that U+327E is 0xa2e8 in cp949 now:
http://sourceware.org/ml/glibc-bugs/2007-02/msg00003.html
This says that 0xa2e8 isn't in cp949:
http://www.microsoft.com/globaldev/reference/dbcs/949/949_A2.mspx
This says it was added to Korean Standard in 2002:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4891024
This says that HP supports the 2002 Korean Standard:
http://docs.hp.com/en/5991-6469/ch11s04.html
This says that iconv shouldn't support it:
http://sources.redhat.com/ml/glibc-bugs/2007-02/msg00012.html

But x0a2e8 is not in the cp949-to-Unicode mapping table at
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT

So we wanted to find out if the mapping table is obsolete and the character
is in cp949. This can be done by
(a) asking Koreans who will understand the question, e.g.
    our Korean partner who runs mysqlkorea.co.kr
(b) testing with Vista + Korean to see if entering 0xa2e8 in cp949
    results in a character that looks like the illustration in
    std.dkuug.dk/jtc1/sc2/wg2/docs/N2815.doc

Bar asked for help on #support.
Bogdan, Tonci and MeijiK kindly had a look into UTF8 and EUC-KR
encoded HTML pages containing "CIRCLED HANGUL IEUNG U" using
Western Vista, Japanese Vista, Win2k8. All OS's displayed
UTF8 version correctly, with a character like this:
http://www.fileformat.info/info/unicode/char/327E/circled_hangul_ieung.png
Non of them displayed this character in EUC-KR version.

So the latest versions of Microsoft platforms do generally
support "CIRCLED HANGUL IEUNG U", but not in cp949.


Other mappings around
---------------------
icu-3.0 additionally maps the following single byte characters
in cp949 mapping for Windows-2000 (but not in cp949 mapping for XP):

cp949  ucs2    name
-----  ------  ----------
0x80   0x0080  #
0xFF   0xF8F7  #

Also, icu-3.0, libiconv-1.12, jdk-1.1.5 map the following
double byte cp949 characters:

0xC9A1..0xC9FE mapped to  U+E000..E05D (94 characters)
0xFEA0..0xFEFE mapped to  U+E05E..E0BB (94 characters) 

It is unknown though if Microsoft does this mapping itself.
These mapping rules are not mentioned here:
http://www.microsoft.com/globaldev/reference/dbcs/949.mspx
We won't support these additional rules.


Changes In Code
---------------

Add the two characters. It is already possible to say
CREATE TABLE tk (s1 VARCHAR(1) CHARACTER SET euckr);
INSERT INTO tk VALUES (0xa2e6),(0xa2e7),(0xa2e8);
without error. So just make sure code comments are right.

Make sure conversion works to other character sets which include
EURO SIGN and REGISTERED SIGN.

For euckr_korean_ci collation, follow current rules:
- case insensitive for basic Latin letters
- using binary code for multi-byte characters.
So the new characters fit here:
...
0xA2E5	#TELEPHONE SIGN
0xA2E6	#EURO SIGN
0xA2E7	#REGISTERED SIGN
0xA2E8  #CIRCLED HANGUL IEUNG U             /* But we won't do this one */
0xA341	#HANGUL SYLLABLE CIEUC YU NIEUNCIEUC
...

Changes in Documentation
------------------------

The CJK FAQ in the MySQL Reference Manual says:
"
29.11.8:  Of what issues should I be aware when working with Korean character
sets in MySQL?

In theory, while there have been several versions of the euckr
(Extended Unix Code Korea) character set, only one problem has been noted.
We use the "ASCII" variant of EUC-KR, in which the code point 0x5c is
REVERSE SOLIDUS, that is \, instead of the “KS-Roman” variant of EUC-KR,
in which the code point 0x5c is WON SIGN(₩). This means that you cannot
convert Unicode U+20A9 to euckr:
mysql> SELECT
    ->     CONVERT(_ucs2 0x20a9 USING euckr) AS euckr,
    ->     HEX(CONVERT(_ucs2 0x20a9 USING euckr)) AS hexeuckr;
+-------+----------+
| euckr | hexeuckr |
+-------+----------+
| ?     | 3F       |
+-------+----------+
1 row in set (0.00 sec)
MySQL's graphic Korean chart is here:
http://www.collation-charts.org/mysql60/mysql604.euckr_korean_ci.html"
"

Possibly we should say, in the newer version of the manual:
"
29.11.8:  Of what issues should I be aware when working with Korean character
sets in MySQL?

There have been several versions of the euckr (Extended Unix Code Korea)
character set. We try to follow the Korean Standard (KS 5601) but also take
the Microsoft Korean code page (cp949) into account. We are conservative
about adding new characters which may not yet be supported by other products
that MySQL must work with. These are the affected characters.

0x5c. We use the "ASCII" variant of EUC-KR, in which the code point 0x5c is
REVERSE SOLIDUS, that is \, instead of the “KS-Roman” variant of EUC-KR, in
which the code point 0x5c is WON SIGN(₩). This means that you cannot convert
Unicode U+20A9 to euckr:
mysql> SELECT
    ->     CONVERT('₩' USING euckr) AS euckr,
    ->     HEX(CONVERT('₩' USING euckr)) AS hexeuckr;
+-------+----------+
| euckr | hexeuckr |
+-------+----------+
| ?     | 3F       |
+-------+----------+
1 row in set (0.00 sec)

0xa2e6, 0xa2e7. The "euro sign" (euckr encoding = 0xa2e6, Unicode equivalent
= U+20AC) and the "registered sign" (euckr encoding = 0xa2e7, Unicode
equivalent = U+00AE) were added to the Korean Standard after 1992, and fully
supported in MySQL since version [fill in version number here]. These
characters are also in cp949.

0xa2e8. The "Circled Hangul Ieung U", informally known as the "Korean Postal
Code Mark" (euckr encoding = 0xa2e8, Unicode equivalent = U+327E) was added
to the Korean Standard for 2002, but is not supported in MySQL.
This character is not in cp949.
See also http://sources.redhat.com/ml/glibc-bugs/2007-02/msg00012.html

Additional Johab. The KS C 5601-1992 standard has 8224 characters, including
2350 precombined hangul (syllable block) characters. An appendix to the standard
mentions an additional 8822 characters but does not include them. Microsoft
cp949 includes them. MySQL recognizes them. So, in this respect, MySQL's
'euckr' is actually more like 'cp949' than like the KS C 5601 requirement.

The euckr-to-unicode mapping table that MySQL follows is the one for cp949
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
plus the two new characters
0xA2E6  U+20AC  #EURO SIGN
0xA2E7  U+00AE  #REGISTERED SIGN
0xA2E8  U+327E  #CIRCLED HANGUL IEUNG U /* no */

MySQL's graphic Korean chart is here:
http://www.collation-charts.org/mysql60/mysql604.euckr_korean_ci.html"
"

Change graphic Korean chart
---------------------------

Since the MySQL documentation mentions the chart at
http://www.collation-charts.org/mysql60/mysql604.euckr_korean_ci.html,
it must be updated for the new characters.

Connectors
----------

Originally this task said:
"According to this 2003 bug report
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4891024
the new characters are not supported yet for the JDK.
Surely this is obsolete and doesn't affect Connector/J anyway,
but we need acknowledgment from Connector/J people before proceeding."

It is now clear that is not necessary, since the characters
are already okay for input. Unicode conversion is not a JDK matter.

References
----------

WL#3332 Korean Enhancements
BUG#8940 Can't insert character ® into table
Customer Support Issue is noted in progress report