WL#3332: Korean Enhancements

Affects: Server-5.1 — Status: Complete

Description
High Level Architecture

The High-Level Specification discusses enhancing cp949 support,
and proposes a small change in strings/ctype-euc_kr.c.

Character Sets
--------------

We support Korean with euckr and Unicode character sets
(utf8, ucs2, etc.).

Our euckr (EUC Korean) is an amalgamation of ISO 646
(ASCII) with the Korean standard KS X 1001 (also known
as KS C 5601-1992). So all the Korean characters require
2 bytes, but this is not a 2-byte character set. Only ucs2
is a 2-byte character set.

Actually our 'euckr' is more like 'cp949'
-----------------------------------------

Although we call it 'euckr' and say it's based on KS C 5601-1992,
in fact it's very close to 'cp949' (Microsoft's Korean code page)
(also known as "with UHC = Extended Hangul Code") (also known as
"Extended Wansung"). It's a superset of euckr. It has 8822 extra
characters, all hangul (Korean syllables).

Alexander Barkov noticed that the Unicode mapping tables for
"KSC5601" and "CP949" are almost exactly the same, except that
"KSC5601" has no euro sign or registered sign:
> I run this shell script:
> 
> wget ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
> wget ftp://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT
> 
> egrep "^0x[0-9A-Z]{4}" CP949.TXT | while read a b c
> do
>    echo $a, $b
> done > cp949-map.txt
> 
> egrep "^0x[0-9A-Z]{4}" KSC5601.TXT | while read a b c
> do
>    echo $a, $b
> done > ksc5601-map.txt
> 
> diff cp949-map.txt ksc5601-map.txt
> 
> 
> and its output is:
> 
> ./compare-cp932-euckr.sh
> 6028,6029d6027
> < 0xA2E6, 0x20AC
> < 0xA2E7, 0x00AE
> 
> 
> No more research is needed.

Peter Gulutzan said, referring to an older version of this
worklog task, that the Unicode mapping table was really for
cp949. Alexander Barkov agreed:

> 1. [Referring to an older version of the worklog description]
> WL#3332 says our euckr set comes from KS C 5601-1992,
> and says that cp949 would be our euckr + 8822 precombined
> hangul. That's false. The clues to the truth are:
> * stason.org says that the old Unicode mapping tables,
>   which are supposed to be tables for euckr, were in
>   fact tables for cp949.
>   "You need to note, however, that KSC5601.TXT in Unicode
>   ftp archive and Unicode 2.0 CD-ROM is actually UHC/MS
>   Code Page 949/Windows 949(see below) to Unicode 2.0
>   mapping table instead of KS X 1001(KS C 5601-1992) to
>   Unicode mapping table as it claims to be."
>  
http://stason.org/TULARC/languages/korean/8-What-are-KS-X-1001-KS-C-5601-and-other-Hangul-codes.html
> * Ken Lunde says KS C 5601-1992 has 8224 characters,
>   of which 2350 are precombined hangul, but ""Annex 3
>   of this standard defines the complete set of 11,172
>   pre-combined hangul characters, also known as Johab."
>   ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
>   But wait: 11172 - 2350 = 8822 -- which is the number
>   of 'new' characters that are supposed to be in cp949!
> * The Unicode mapping table for euckr has 17046 entries.
>   ftp://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT
>   At first that seems odd, since Lunde (above) says there
>   are only 8224 characters.
>   But wait: 17046 - 8224 = 8822 -- which, again, is supposed
>   to be the difference between euckr and cp949.
> Conclusion: MySQL has never used strict KS C 5601-1992. It has
> used that plus the 8822 'annex 3' characters -- which means it has
> always used cp949.

Therefore the "Proposal" which was in the earlier editions of
this worklog task -- to add cp949 -- is cancelled.

Reported problem with cp949
---------------------------

The following is from a Korean customer's report (some details are redacted):
"
And one more thing...

Euckr(ksc5601) code range :
[A1...FE] [A1...FE]

Cp949(Extended Euckr) code range :
[81...A0] [41...5A,61...7A,81...FE]
[A1...C5] [41...5A,61...7A,81...A0]
C6 [41...52]
[A1...FE] [A1...FE] -> ksc5601(=ksx1001)

I attached a image about code range of cp949(extened euckr) and ksc5601(=ksx1001).

"strings/ctype-euc_kr.c" source code already has all definition and mapping rule
between cp949 and Unicode. (except 0xA2E6, 0xA2E7)
So, MySQL "Euckr" table can store all cp949(extended euckr) character.
But, I can't save all cp949 code to euckr table.
Because iseuc_kr_head() function has a problem.

*ctype-euc_kr.c LINE 194*
#define iseuc_kr_head(c) ((0xa1<=(uchar)(c) && (uchar)(c)<=0xfe))
#define iseuc_kr_tail1(c) ((uchar) (c) >= 0x41 && (uchar) (c) <= 0x5A)
#define iseuc_kr_tail2(c) ((uchar) (c) >= 0x61 && (uchar) (c) <= 0x7A)
#define iseuc_kr_tail3(c) ((uchar) (c) >= 0x81 && (uchar) (c) <= 0xFE)

According to these function, euckr table can save characters below range.
[A1..FE][41..5A,61..7A,81..FE]
This range covers all ksc5601(ksx1001) and most of cp949(extended euckr).

To cover all cp949 & ksc5601 code, this function have to modified like..
#define iseuc_kr_head(c) ((0x81<=(uchar)(c) && (uchar)(c)<=0xfe))
// head character is from 0x81 (not 0xa1) to 0xfe

After modification, euckr table can store all cp949(extended euckr) code.
[81...FE] [41..5A,61..7A,81..FE]
"

[Change on 2009-06-23] The sections about "Won sign problem" and "Variant
euckr-to-Unicode conversion" and "Collations" and parts of "References"
have been moved to: WL#5021 Korean Enhancements, Deferred

References
----------

cp949 code chart:
http://msdn.microsoft.com/ru-ru/goglobal/cc305154(en-us).aspx

cp949 short description:
http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V51B_HTML/SUPPDOCS/KOREADOC/KOREACH2.HTM

Short description of KS C 5601:
http://ra.dkuug.dk/CEN/TC304/guide/GGRAPH.HTM

WL#3997 New euckr characters