WL#1820: Variant SJIS and UJIS Japanese Character Sets
Affects: Server-6.1
—
Status: Assigned
We want more Japanese character sets, to handle the variant encodings in different SJIS and UJIS versions. This affects yen sign, reverse solidus, overline, and tilde. Or, we're just trying to fix for 0x5c and 0x7e. There are several other cases where fullwidth makes a difference for mapping, so we'll try to handle them too.
The Yen+Tilde Character Sets ---------------------------- This requirement was originally added due to a complaint from a customer whose name is visible in the Progress notes for 2010-02-05. We will not change the current behaviour of sjis. Confirm and document that it is equal to: x-sjis-jdk-1.1.7 our name = sjis x-sjis-cp932 our name = cp932 Add these character sets x-sjis-jisx0221-1995* our name = sjisjisx0221 x-sjis-unicode-0.9 our name = sjisunicode We will not change the current behaviour of ujis. Confirm and document that it is equal to: x-eucjp-unicode-0.9 our name = ujis x-eucjp-open-19970715-ms our name = eucjpms Add these character sets x-eucjp-open-19970715-0201 our name = eucjpopen0221 x-eucjp-open-19970715-ascii our name = eucjpopenascii x-eucjp-jisx0221-1995 our name = eucjpjisx02211995 Where I have "our name =" I [Peter Gulutzan] am just guessing. I originally suggested names like 'sjis_jisx0221' but Bar [Alexander Barkov] would like us to avoid use of '_' in character set names. Oracle gets by with only a few extra "...YEN" and "...TILDE" character sets: (SJIS) JA16MACSJIS, JA16SJIS, JA16SJISTILDE, A16SJISYEN (UJIS) JA16EUC, JA16EUCTILDE, JA16EUCYEN. Described in: http://download- west.oracle.com/docs/cd/A91202_01/901_doc/server.901/a90236/appa.htm#956722 The Yen+Tilde character sets: Conversion Charts ----------------------------------------------- The differences are apparently slight, according to some sources, and these are the main variants. But the same sources don't appear to mention all possible differences, for example cp932 and sjis have a different kanji repertoire which isn't described here. So some further complications might exist beyond what I describe here. Here is an SJIS variation list from http://www.y-adagio.com/public/standards/tr_xml_jpf/kaisetsu.htm I added the IBM-943 column based on http://www-950.ibm.com/software/globalization/icu/demo/converters?conv=ibm-943 but at this time am not proposing that we support IBM-943. Code Point x-sjis-jdk1.1.7 x-sjis-unicode-0.9 x-sjis-jisx0221-1995 ---------- --------------- ------------------ -------------------- 0x5C U+005C(REVERSE SOLIDUS) U+00A5(YEN SIGN) U+00A5(YEN SIGN) 0x7E U+007E(TILDE) U+203E(OVERLINE) U+203E(OVERLINE) 0x815C U+2015(HORIZONTAL BAR) U+2015(HORIZONTAL BAR) U+2014(EM DASH) 0x815F U+005C(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS)U+005C(REVERSE SOLIDUS) 0x8160 U+301C(WAVE DASH) U+301C(WAVE DASH) U+301C(WAVE DASH) 0x8161 U+2016(DOUBLEVERTICAL) U+2016(DOUBLEVERTICAL) U+2016(DOUBLEVERTICAL) 0x817C U+2212(MINUS SIGN) U+2212(MINUS SIGN) U+2212(MINUS SIGN) 0x8191 U+00A2(CENT SIGN) U+00A2(CENT SIGN) U+00A2(CENT SIGN) 0x8192 U+00A3(POUND SIGN) U+00A3(POUND SIGN) U+00A3(POUND SIGN) 0x81CA U+00AC(NOT SIGN) U+00AC(NOT SIGN) U+00AC(NOT SIGN) Code Point x-sjis-cp932 IBM 943 ---------- ------------ ------- 0x5c U+005C(REVERSE SOLIDUS) U+00A5(YEN SIGN) 0x7e U+007E(TILDE) U+203E(OVERLINE) 0x815C U+2015(HORIZONTAL BAR) U+2014(EM DASH) 0x815F U+FF3C(FULLWIDTH REVERSE SOLIDUS) U+FF3C(FULLWIDTH REVERSE SOLIDUS) 0x8160 U+FF5E(FULLWIDTH TILDE) U+301C(WAVE DASH) 0x8161 U+2225(PARALLEL TO) U+2016(DOUBLEVERTICAL) 0x817C U+FF0D(FULLWIDTH HYPHEN-MINUS) U+2212(MINUS SIGN) 0x8191 U+FFE0(FULLWIDTH CENT SIGN) U+FFE0(FULLWIDTH CENT SIGN) 0x8192 U+FFE1(FULLWIDTH POUND SIGN) U+FFE1(FULLWIDTH POUND SIGN) 0x81CA U+FFE2(FULLWIDTH NOT SIGN) U+FFE2(FULLWIDTH NOT SIGN) And here supposedly is an Apple SJIS variant, unconfirmed, according to http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT Code Point Mac OS ---------- ------ 0x5c U+00A5(YEN SIGN) 0x7e U+007E(TILDE) 0x80 U+005C(REVERSE SOLIDUS) (!!) 0x815c U+2014(EM DASH) 0x815F U+FF3C(FULLWIDTH REVERSE SOLIDUS) 0x8160 U+301C(WAVE DASH) 0x8161 U+2016(DOUBLE VERTICAL) 0x817C U+2212(MINUS SIGN) 0x8191 U+00A2(CENT SIGN) 0x8192 U+00A3(POUND SIGN) 0x81CA U+00AC(NOT SIGN) It shows what the value in the "Code Point" column converts to, in ucs2. For example _sjis_jdk 0x81ca = _ucs2 0xac, _sjis_ibm943 = _ucs2 0xffe2. MySQL's "sjis" fits x-sjis-jdk1.1.7. A partial translation of the adagio table comments: http://sources.redhat.com/ml/libc-alpha/2000-10/msg00190.html Here is a UJIS variation list from http://www.y-adagio.com/public/standards/tr_xml_jpf/kaisetsu.htm Code Point x-eucjp-unicode-0.9 x-eucjp-jisx0221-1995 ---------- ------------------- 0x5C U+005C(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS) 0x7E U+007E(TILDE) U+007E(TILDE) 0xA1B1 U+FFE3(FULLWIDTH MACRON) U+FFE3(FULLWIDTH MACRON) 0xA1BD U+2015(HORIZONTAL BAR) U+2014(EM DASH) 0xA1C0 U+005C(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS) 0xA1C1 U+301C(WAVE DASH) U+301C(WAVE DASH) 0xA1C2 U+2016(DOUBLEVERTICAL) U+2016(DOUBLE) 0xA1DD U+2212(MINUS SIGN) U+2212(MINUS SIGN) 0xA1F1 U+00A2(CENT SIGN) U+00A2(CENT SIGN) 0xA1F2 U+00A3(POUND SIGN) U+00A3(POUND SIGN) 0xA1EF U+FFE5(FULLWIDTH YEN SIGN) U+FFE5(FULLWIDTH YEN SIGN) 0xA2CC U+00AC(NOT SIGN) U+00AC(NOT SIGN) 0x8FA2B7 U+007E(TILDE) U+007E(TILDE) 0x8FA2C3 U+00A6(BROKEN BAR) U+00A6(BROKEN BAR) Code Point x-eucjp-open-19970715-ms x-eucjp- x-eucjp- open- open- 19970715- 19970715- ms ascii ---------- ------------------------ ---------- --------- 0x5C U+005C(REVERSE SOLIDUS) 00A5 5C 0x7E U+007E(TILDE) 203E 7E 0xA1B1 U+FFE3(FULLWIDTH MACRON) FFE3 203E 0xA1BD U+2015(HORIZONTAL BAR) 2014 2014 0xA1C0 U+FF3C(FULLWIDTH REVERSE SOLIDUS) 005C FF3C 0xA1C1 U+FF5E(FULLWIDTH TILDE) 301C 301C 0xA1C2 U+2225(PARALLEL TO) 2016 2016 0xA1DD U+FF0D(FULLWIDTH HYPHEN-MINUS) 2212 2212 0xA1F1 U+FFE0(FULLWIDTH CENT SIGN) 00A2 00A2 0xA1F2 U+FFE1(FULLWIDTH POUND SIGN) 00A3 00A3 0xA1EF U+FFE5(FULLWIDTH YEN SIGN) FFE5 00A5 0xA2CC U+FFE2(FULLWIDTH NOT SIGN) 00AC 00AC 0x8FA2B7 U+FF5E(FULLWIDTH TILDE) 007E FF5E 0x8FA2C3 U+FFE4(FULLWIDTH BROKEN BAR) 00A6 00A6 MySQL's "ujis" fits x-eucjp-unicode-0.9. MySQL's "eucjpms" fits x-eucjp-open-19970715-ms. The mm character sets --------------------- This requirement was originally added due to a suggestion from a Japanese user group represented by Shuichi Tamagawa. Four more Japanese character sets: sjismm same as sjis eucjpmsmm same as eucjpms ujismm same as ujis cp932mm same as cp932 There will be no 'mm' equivalents for the new character sets described in the previous section (sjisjisx0221 etc.). The Japanese users' suggestion was only for current character sets. The initials 'mm' stand for 'multiple mapping'. Saying sjismm is "same as sjis" means the repertoire is the same, the representation (code points) are the same, the applicable ordering rules are the same. The only difference is the way that conversions occur, that is, when casting or converting to/from another character set such as ucs2, the results may differ for 22 characters. The mm character sets: Conversion Chart --------------------------------------- In this conversion table, the ucs2 column is the source, and the sjis/cp932/ujis/eucjpms/etc. columns are the destination, that is, what the hexadecimal result would be if we used convert(ucs2) or if we assigned a ucs2 column containing the value to an sjis/cp932/ujis/eucjpms column. character name ucs2 sjis sjis cp93 cp93 ujis ujis eucj eucj mm 2 mm mm pms pmsmm ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- BROKEN BAR 00A6 3F 3F 3F FA55 8FA2C3 8FA2C3 3F 8FA2C3 FULLWIDTH BROKEN BAR FFE4 3F 3F FA55 FA55 3F 8FA2C3 8FA2C3 8FA2C3 YEN SIGN 00A5 3F 5C 3F 5C 20 5C 3F 5C FULLWIDTH YEN SIGN FFE5 818F ? 818F ? A1EF A1EF 3F A1EF TILDE 007E 7E 7E 7E 7E 7E 7E 7E 7E OVERLINE 203E 3F 7E 3F 7E 20 7E 3F 7E HORIZONTAL BAR 2015 815C 815C 815C 815C A1BD A1BD A1BD A1BD EM DASH 2014 3F 815C 3F 815C 3F A1BD 3F A1BD REVERSE SOLIDUS 005C 815F 5C 5C 5C 5C 5C 5C 5C FULLWIDTH "" FF3C 3F 815F 815F 815F 3F A1C0 A1C0 A1C0 WAVE DASH 301C 8160 8160 3F 8160 A1C1 A1C1 3F A1C1 FULLWIDTH TILDE FF5E 3F 8160 8160 8160 3F A1C1 A1C1 A1C1 DOUBLE VERTICAL LINE 2016 8161 8161 3F 8161 A1C2 A1C2 3F A1C2 PARALLEL TO 2225 3F 8161 8161 8161 3F A1C2 A1C2 A1C2 MINUS SIGN 2212 817C 817C 3F 817E A1DD A1DD 3F A1DD FULLWIDTH HYPHEN-MINUS FF0D 3F 817C 817C 817C 3F A1DD A1DD A1DD CENT SIGN 00A2 8191 8191 3F 8191 A1F1 A1F1 3F A1F1 FULLWIDTH CENT SIGN FFE0 3F 8191 8191 8191 3F A1F1 A1F1 A1F1 POUND SIGN 00A3 8192 8192 3F 8192 A1F2 A1F2 3F A1F2 FULLWIDTH POUND SIGN FFE1 3F 8192 8192 8192 3F A1F2 A1F2 A1F2 NOT SIGN 00AC 81CA 81CA 3F 81CA A2CC A2CC 3F A2CC FULLWIDTH NOT SIGN FFE2 3F 81CA 81CA 81CA 3F A2CC A2CC A2CC (No special rules for U+FFE3 FULLWIDTH MACRON or U+2026 HORIZONTAL ELLIPSIS.) For example, consider this extract from the above table: ucs2 sjis sjis curr mm ---- ---- ---- NOT SIGN 00AC 81CA 81CA FULLWIDTH NOT SIGN FFE2 3F 81CA It means "for NOT SIGN which is Unicode U+00AC, MySQL converts to sjis code point 0x81CA and that's the same thing that Shuichi Tamagawa wants (no change), but for FULLWIDTH NOT SIGN which is Unicode U+FFE2, MySQL converts to sjis code point 0x3F which is question mark "?" and that is different from Mr Tamagawa's patch, which would convert to 0x81CA." Looking for a pattern in the table, you'll see that the frequent desire is to replace 0x3F with the closest fullwidth coding (if sjis/ujis), or the closest not-fullwidth coding (if cp932/eucjpms). That's loose and that is not the universal rule, therefore we keep the old way (in sjis etc.) but allow a new way (in sjismm etc.). The Japanese group did not propose changes for the reverse direction, i.e. with destination = ucs2, except for "sjis" 0x815F and "ujis" 0xA1C0. I [Peter Gulutzan] will propose here a rule which is compatible with the "sjis 0x85F / ujis A1C0" request, for all cases: When two possibilities exist, take the bigger. For example, using NOT SIGN again, we see that _sjismm 0x81ca can be converted to either _ucs2 0x00ac or _ucs2 0xffe2. Choose _ucs2 0xffe2 because it is bigger. And similarly, since _sjismm 0x81ca can be converted to either _ujis 0xa2cc or _ujis 0x3f, choose _ujis 0xa2cc because it is bigger. (The exception, of course, is that 0x3f always converts to 0x3f and 0x20 always converts to 0x20.) This does not help round tripping. But that's okay, round tripping won't work anyway. The chart also shows two slight errors in the current character sets: U+00A5 converts to _ujis 20 instead of 3F U+203E converts to _ujis 20 instead of 3F But we will not fix these errors. Minor note: if MySQL had an mm for the eucjp-open "ascii" variant, then it would convert U+203E to A1B1. Collations ---------- Collations for the new character sets will be the same as the current character sets. It might prevent confusion if we had different names, for example sjismm_japanese_ci not sjis_japanese_ci. But in fact the differing characters are in the same positions. Does that mean that we can say ... WHERE _sjis 'a' = _sjismm 'a'? Answer: NO. That's easy but special rules confuse. Advice ------ The variant character sets will be little used, so save space on table lookups, and say that converting with these character sets will be a little slower. JIS 2004 -------- Japanese Industrial Standard X 0213 came out in 2000 (JIS X 0213:2000) and was revised in 2004 (JIS X 0213:2004). A Wikipedia introduction is: http://en.wikipedia.org/wiki/JIS_X_0213 Both JIS X 0213:2000 and JIS X 0213:2004 added characters outside the Unicode BMP, but MySQL added support for non-BMP characters due to WL#1213 "Implement 4-byte UTF8, UTF16 and UTF32". JIS X 0213:2004, often called "JIS 2004" or "JIS2004", is important because Vista supports it. "Mapping Tables between JIS X 0213 and Unicode" for SJIS + UJIS are here: http://x0213.org/codetable/index.en.html It includes: * Mapping table between Shift_JIS-2004 and Unicode * Mapping table between EUC-JIS-2004 and Unicode * Mapping table between ISO-2022-JP-2004 and Unicode * Mapping table between JIS X 0213:2004 7-bit code and Unicode For each new character, these documents state which version of JIS X 0213 and Unicode the character first appeared in. So we have a clear specification. The unsettled point is: should the new characters be added to the existing SJIS and UJIS character sets, or should there be even more character sets? Alexander Barkov and Peter Gulutzan discussed this in May/June 2007 "Extending ujis wirth JIS X 0213 characters" (dev-public thread), [ mysql intranet archive] /secure/mailarchive/mail.php?folder=5&mail=67744 References ---------- http://www.haible.de/bruno/charsets/conversion-tables/Japanese.html http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html http://www.opengroup.or.jp/jvc/cde/appendix-e.html http://www.w3.org/TR/2000/NOTE-japanese-xml-20000414/ http://sources.redhat.com/ml/libc-alpha/2000-10/msg00190.html http://x0213.org/codetable/index.en.html http://www.debian.org/doc/manuals/intro-i18n/intro-i18n.txt BUG#50934 hyphen is not mapped A long series of emails about the mm suggestion is in thread "RE: Feedback and Requests from Japanese users community" (not in a public archive but available from Peter Gulutzan or Alexander Barkov) See also WL#2555 Standard Japanese collation support.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.