WL#3997: New euckr characters
Affects: Server-5.1 — Status: Complete — Priority: Low
Add these characters to euckr: Euro Sign Registered Sign Perhaps also add this character: Circled Hangul Ieung U (i.e. "Postal code mark")
MySQL's euckr character set is ultimately based on the Korean Standard KS C 5601-1992. But the Korean authorities have added three new characters since then. All three characters are in the modern 'euckr' (as seen in iconv); two of the three are definitely in cp949 (the Microsoft variant of euckr). euckr ucs2 encoding KS version name ------ ------------- ---------- ---- 0xa2e6 U+20AC 1998 EURO SIGN 0xa2e7 U+00AE 1998 REGISTERED SIGN 0xa2e8 U+327E 2002 CIRCLED HANGUL IEUNG U These characters are not in the KS C 5601 Unicode mapping table ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT But EURO SIGN and REGISTERED SIGN are in the cp949 Unicode mapping table http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT And MySQL has always supported the 8822 pre-combined hangul (syllabic block) characters which are not in strict KS C 5601 (they are only in an Annex) but are in cp949. In other words, MySQL really supports cp949, let's make it complete by adding EURO SIGN and REGISTERED SIGN. The third character, CIRCLED HANGUL IEUNG U, informally known as "Korean Postal Code Mark", is also in the Korean Standard now. But it is not listed in the cp949 Unicode mapping table. So we did more research and decided not to add CIRCLED HANGUL IEUNG U. Research for CIRCLED HANGUL IEUNG U ----------------------------------- This says that Korean postal code mark = circled hangul ieung u: std.dkuug.dk/jtc1/sc2/wg2/docs/N2815.doc This says that circled hangul ieung u is in BMP, as U+327e: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt This says that U+327E is 0xa2e8 in cp949 now: http://sourceware.org/ml/glibc-bugs/2007-02/msg00003.html This says that 0xa2e8 isn't in cp949: http://www.microsoft.com/globaldev/reference/dbcs/949/949_A2.mspx This says it was added to Korean Standard in 2002: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4891024 This says that HP supports the 2002 Korean Standard: http://docs.hp.com/en/5991-6469/ch11s04.html This says that iconv shouldn't support it: http://sources.redhat.com/ml/glibc-bugs/2007-02/msg00012.html But x0a2e8 is not in the cp949-to-Unicode mapping table at http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT So we wanted to find out if the mapping table is obsolete and the character is in cp949. This can be done by (a) asking Koreans who will understand the question, e.g. our Korean partner who runs mysqlkorea.co.kr (b) testing with Vista + Korean to see if entering 0xa2e8 in cp949 results in a character that looks like the illustration in std.dkuug.dk/jtc1/sc2/wg2/docs/N2815.doc Bar asked for help on #support. Bogdan, Tonci and MeijiK kindly had a look into UTF8 and EUC-KR encoded HTML pages containing "CIRCLED HANGUL IEUNG U" using Western Vista, Japanese Vista, Win2k8. All OS's displayed UTF8 version correctly, with a character like this: http://www.fileformat.info/info/unicode/char/327E/circled_hangul_ieung.png Non of them displayed this character in EUC-KR version. So the latest versions of Microsoft platforms do generally support "CIRCLED HANGUL IEUNG U", but not in cp949. Other mappings around --------------------- icu-3.0 additionally maps the following single byte characters in cp949 mapping for Windows-2000 (but not in cp949 mapping for XP): cp949 ucs2 name ----- ------ ---------- 0x80 0x0080 #<control> 0xFF 0xF8F7 #<Private Use> Also, icu-3.0, libiconv-1.12, jdk-1.1.5 map the following double byte cp949 characters: 0xC9A1..0xC9FE mapped to <Private Use> U+E000..E05D (94 characters) 0xFEA0..0xFEFE mapped to <Private Use> U+E05E..E0BB (94 characters) It is unknown though if Microsoft does this mapping itself. These mapping rules are not mentioned here: http://www.microsoft.com/globaldev/reference/dbcs/949.mspx We won't support these additional rules. Changes In Code --------------- Add the two characters. It is already possible to say CREATE TABLE tk (s1 VARCHAR(1) CHARACTER SET euckr); INSERT INTO tk VALUES (0xa2e6),(0xa2e7),(0xa2e8); without error. So just make sure code comments are right. Make sure conversion works to other character sets which include EURO SIGN and REGISTERED SIGN. For euckr_korean_ci collation, follow current rules: - case insensitive for basic Latin letters - using binary code for multi-byte characters. So the new characters fit here: ... 0xA2E5 #TELEPHONE SIGN 0xA2E6 #EURO SIGN 0xA2E7 #REGISTERED SIGN 0xA2E8 #CIRCLED HANGUL IEUNG U /* But we won't do this one */ 0xA341 #HANGUL SYLLABLE CIEUC YU NIEUNCIEUC ... Changes in Documentation ------------------------ The CJK FAQ in the MySQL Reference Manual says: " 29.11.8: Of what issues should I be aware when working with Korean character sets in MySQL? In theory, while there have been several versions of the euckr (Extended Unix Code Korea) character set, only one problem has been noted. We use the "ASCII" variant of EUC-KR, in which the code point 0x5c is REVERSE SOLIDUS, that is \, instead of the “KS-Roman” variant of EUC-KR, in which the code point 0x5c is WON SIGN(₩). This means that you cannot convert Unicode U+20A9 to euckr: mysql> SELECT -> CONVERT(_ucs2 0x20a9 USING euckr) AS euckr, -> HEX(CONVERT(_ucs2 0x20a9 USING euckr)) AS hexeuckr; +-------+----------+ | euckr | hexeuckr | +-------+----------+ | ? | 3F | +-------+----------+ 1 row in set (0.00 sec) MySQL's graphic Korean chart is here: http://www.collation-charts.org/mysql60/mysql604.euckr_korean_ci.html" " Possibly we should say, in the newer version of the manual: " 29.11.8: Of what issues should I be aware when working with Korean character sets in MySQL? There have been several versions of the euckr (Extended Unix Code Korea) character set. We try to follow the Korean Standard (KS 5601) but also take the Microsoft Korean code page (cp949) into account. We are conservative about adding new characters which may not yet be supported by other products that MySQL must work with. These are the affected characters. 0x5c. We use the "ASCII" variant of EUC-KR, in which the code point 0x5c is REVERSE SOLIDUS, that is \, instead of the “KS-Roman” variant of EUC-KR, in which the code point 0x5c is WON SIGN(₩). This means that you cannot convert Unicode U+20A9 to euckr: mysql> SELECT -> CONVERT('₩' USING euckr) AS euckr, -> HEX(CONVERT('₩' USING euckr)) AS hexeuckr; +-------+----------+ | euckr | hexeuckr | +-------+----------+ | ? | 3F | +-------+----------+ 1 row in set (0.00 sec) 0xa2e6, 0xa2e7. The "euro sign" (euckr encoding = 0xa2e6, Unicode equivalent = U+20AC) and the "registered sign" (euckr encoding = 0xa2e7, Unicode equivalent = U+00AE) were added to the Korean Standard after 1992, and fully supported in MySQL since version [fill in version number here]. These characters are also in cp949. 0xa2e8. The "Circled Hangul Ieung U", informally known as the "Korean Postal Code Mark" (euckr encoding = 0xa2e8, Unicode equivalent = U+327E) was added to the Korean Standard for 2002, but is not supported in MySQL. This character is not in cp949. See also http://sources.redhat.com/ml/glibc-bugs/2007-02/msg00012.html Additional Johab. The KS C 5601-1992 standard has 8224 characters, including 2350 precombined hangul (syllable block) characters. An appendix to the standard mentions an additional 8822 characters but does not include them. Microsoft cp949 includes them. MySQL recognizes them. So, in this respect, MySQL's 'euckr' is actually more like 'cp949' than like the KS C 5601 requirement. The euckr-to-unicode mapping table that MySQL follows is the one for cp949 http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT plus the two new characters 0xA2E6 U+20AC #EURO SIGN 0xA2E7 U+00AE #REGISTERED SIGN 0xA2E8 U+327E #CIRCLED HANGUL IEUNG U /* no */ MySQL's graphic Korean chart is here: http://www.collation-charts.org/mysql60/mysql604.euckr_korean_ci.html" " Change graphic Korean chart --------------------------- Since the MySQL documentation mentions the chart at http://www.collation-charts.org/mysql60/mysql604.euckr_korean_ci.html, it must be updated for the new characters. Connectors ---------- Originally this task said: "According to this 2003 bug report http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4891024 the new characters are not supported yet for the JDK. Surely this is obsolete and doesn't affect Connector/J anyway, but we need acknowledgment from Connector/J people before proceeding." It is now clear that is not necessary, since the characters are already okay for input. Unicode conversion is not a JDK matter. References ---------- WL#3332 Korean Enhancements BUG#8940 Can't insert character ® into table Customer Support Issue is noted in progress report
Copyright (c) 2000, 2017, Oracle Corporation and/or its affiliates. All rights reserved.