WL#1820: Variant SJIS and UJIS Japanese Character Sets

Affects: Server-6.1 — Status: Assigned

Description
High Level Architecture

We want more Japanese character sets, to handle the variant
encodings in different SJIS and UJIS versions. This
affects yen sign, reverse solidus, overline, and tilde.
Or, we're just trying to fix for 0x5c and 0x7e.
There are several other cases where fullwidth makes a
difference for mapping, so we'll try to handle them too.

The Yen+Tilde Character Sets
----------------------------

This requirement was originally added due to a
complaint from a customer whose name is visible
in the Progress notes for 2010-02-05.

We will not change the current behaviour of sjis.
Confirm and document that it is equal to:
x-sjis-jdk-1.1.7                  our name = sjis
x-sjis-cp932                      our name = cp932
Add these character sets
x-sjis-jisx0221-1995*             our name = sjisjisx0221
x-sjis-unicode-0.9                our name = sjisunicode

We will not change the current behaviour of ujis.
Confirm and document that it is equal to:
x-eucjp-unicode-0.9               our name = ujis
x-eucjp-open-19970715-ms          our name = eucjpms
Add these character sets
x-eucjp-open-19970715-0201        our name = eucjpopen0221
x-eucjp-open-19970715-ascii       our name = eucjpopenascii
x-eucjp-jisx0221-1995             our name = eucjpjisx02211995

Where I have "our name =" I [Peter Gulutzan] am just guessing.
I originally suggested names like 'sjis_jisx0221' but Bar
[Alexander Barkov] would like us to avoid use of '_' in character
set names.

Oracle gets by with only a few extra "...YEN" and "...TILDE"
character sets:
(SJIS) JA16MACSJIS, JA16SJIS, JA16SJISTILDE, A16SJISYEN
(UJIS) JA16EUC, JA16EUCTILDE, JA16EUCYEN. Described in:
http://download-
west.oracle.com/docs/cd/A91202_01/901_doc/server.901/a90236/appa.htm#956722

The Yen+Tilde character sets: Conversion Charts
-----------------------------------------------

The differences are apparently slight, according to some sources,
and these are the main variants. But the same sources don't appear
to mention all possible differences, for example cp932 and sjis
have a different kanji repertoire which isn't described here. So
some further complications might exist beyond what I describe here.

Here is an SJIS variation list from
http://www.y-adagio.com/public/standards/tr_xml_jpf/kaisetsu.htm
I added the IBM-943 column based on
http://www-950.ibm.com/software/globalization/icu/demo/converters?conv=ibm-943
but at this time am not proposing that we support IBM-943.

Code Point x-sjis-jdk1.1.7         x-sjis-unicode-0.9     x-sjis-jisx0221-1995    
---------- ---------------         ------------------     --------------------    
0x5C       U+005C(REVERSE SOLIDUS) U+00A5(YEN SIGN)       U+00A5(YEN SIGN)        
0x7E       U+007E(TILDE)           U+203E(OVERLINE)       U+203E(OVERLINE)        
0x815C     U+2015(HORIZONTAL BAR)  U+2015(HORIZONTAL BAR) U+2014(EM DASH)         
0x815F     U+005C(REVERSE SOLIDUS) U+005C(REVERSE SOLIDUS)U+005C(REVERSE SOLIDUS) 
0x8160     U+301C(WAVE DASH)       U+301C(WAVE DASH)      U+301C(WAVE DASH)       
0x8161     U+2016(DOUBLEVERTICAL)  U+2016(DOUBLEVERTICAL) U+2016(DOUBLEVERTICAL)  
0x817C     U+2212(MINUS SIGN)      U+2212(MINUS SIGN)     U+2212(MINUS SIGN)      
0x8191     U+00A2(CENT SIGN)       U+00A2(CENT SIGN)      U+00A2(CENT SIGN)       
0x8192     U+00A3(POUND SIGN)      U+00A3(POUND SIGN)     U+00A3(POUND SIGN)      
0x81CA     U+00AC(NOT SIGN)        U+00AC(NOT SIGN)       U+00AC(NOT SIGN)        

Code Point x-sjis-cp932                       IBM 943
---------- ------------                       -------
0x5c       U+005C(REVERSE SOLIDUS)            U+00A5(YEN SIGN)
0x7e       U+007E(TILDE)                      U+203E(OVERLINE)
0x815C     U+2015(HORIZONTAL BAR)             U+2014(EM DASH)
0x815F     U+FF3C(FULLWIDTH REVERSE SOLIDUS)  U+FF3C(FULLWIDTH REVERSE SOLIDUS)
0x8160     U+FF5E(FULLWIDTH TILDE)            U+301C(WAVE DASH)
0x8161     U+2225(PARALLEL TO)                U+2016(DOUBLEVERTICAL)
0x817C     U+FF0D(FULLWIDTH HYPHEN-MINUS)     U+2212(MINUS SIGN)
0x8191     U+FFE0(FULLWIDTH CENT SIGN)        U+FFE0(FULLWIDTH CENT SIGN)
0x8192     U+FFE1(FULLWIDTH POUND SIGN)       U+FFE1(FULLWIDTH POUND SIGN)
0x81CA     U+FFE2(FULLWIDTH NOT SIGN)         U+FFE2(FULLWIDTH NOT SIGN)

And here supposedly is an Apple SJIS variant, unconfirmed, according to
http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT
Code Point Mac OS
---------- ------
0x5c       U+00A5(YEN SIGN)
0x7e       U+007E(TILDE)
0x80       U+005C(REVERSE SOLIDUS) (!!)
0x815c     U+2014(EM DASH)
0x815F     U+FF3C(FULLWIDTH REVERSE SOLIDUS)
0x8160     U+301C(WAVE DASH)
0x8161     U+2016(DOUBLE VERTICAL)
0x817C     U+2212(MINUS SIGN)
0x8191     U+00A2(CENT SIGN)
0x8192     U+00A3(POUND SIGN)
0x81CA     U+00AC(NOT SIGN)


It shows what the value in the "Code Point" column converts to, in ucs2.
For example _sjis_jdk 0x81ca = _ucs2 0xac, _sjis_ibm943 = _ucs2 0xffe2.

MySQL's "sjis" fits x-sjis-jdk1.1.7.

A partial translation of the adagio table comments:
http://sources.redhat.com/ml/libc-alpha/2000-10/msg00190.html

Here is a UJIS variation list from
http://www.y-adagio.com/public/standards/tr_xml_jpf/kaisetsu.htm

Code Point x-eucjp-unicode-0.9        x-eucjp-jisx0221-1995      
---------- -------------------
0x5C       U+005C(REVERSE SOLIDUS)    U+005C(REVERSE SOLIDUS)    
0x7E       U+007E(TILDE)              U+007E(TILDE)              
0xA1B1     U+FFE3(FULLWIDTH MACRON)   U+FFE3(FULLWIDTH MACRON)   
0xA1BD     U+2015(HORIZONTAL BAR)     U+2014(EM DASH)            
0xA1C0     U+005C(REVERSE SOLIDUS)    U+005C(REVERSE SOLIDUS)    
0xA1C1     U+301C(WAVE DASH)          U+301C(WAVE DASH)          
0xA1C2     U+2016(DOUBLEVERTICAL)     U+2016(DOUBLE)             
0xA1DD     U+2212(MINUS SIGN)         U+2212(MINUS SIGN)         
0xA1F1     U+00A2(CENT SIGN)          U+00A2(CENT SIGN)          
0xA1F2     U+00A3(POUND SIGN)         U+00A3(POUND SIGN)         
0xA1EF     U+FFE5(FULLWIDTH YEN SIGN) U+FFE5(FULLWIDTH YEN SIGN) 
0xA2CC     U+00AC(NOT SIGN)           U+00AC(NOT SIGN)           
0x8FA2B7   U+007E(TILDE)              U+007E(TILDE)              
0x8FA2C3   U+00A6(BROKEN BAR)         U+00A6(BROKEN BAR)         

Code Point x-eucjp-open-19970715-ms          x-eucjp-   x-eucjp-
                                             open-      open-
                                             19970715-  19970715-
                                             ms         ascii
---------- ------------------------          ---------- ---------
0x5C       U+005C(REVERSE SOLIDUS)           00A5       5C
0x7E       U+007E(TILDE)                     203E       7E
0xA1B1     U+FFE3(FULLWIDTH MACRON)          FFE3       203E
0xA1BD     U+2015(HORIZONTAL BAR)            2014       2014
0xA1C0     U+FF3C(FULLWIDTH REVERSE SOLIDUS) 005C       FF3C
0xA1C1     U+FF5E(FULLWIDTH TILDE)           301C       301C
0xA1C2     U+2225(PARALLEL TO)               2016       2016
0xA1DD     U+FF0D(FULLWIDTH HYPHEN-MINUS)    2212       2212
0xA1F1     U+FFE0(FULLWIDTH CENT SIGN)       00A2       00A2
0xA1F2     U+FFE1(FULLWIDTH POUND SIGN)      00A3       00A3
0xA1EF     U+FFE5(FULLWIDTH YEN SIGN)        FFE5       00A5
0xA2CC     U+FFE2(FULLWIDTH NOT SIGN)        00AC       00AC
0x8FA2B7   U+FF5E(FULLWIDTH TILDE)           007E       FF5E
0x8FA2C3   U+FFE4(FULLWIDTH BROKEN BAR)      00A6       00A6

MySQL's "ujis" fits x-eucjp-unicode-0.9.
MySQL's "eucjpms" fits x-eucjp-open-19970715-ms.

The mm character sets
---------------------

This requirement was originally added due to a suggestion
from a Japanese user group represented by Shuichi Tamagawa.

Four more Japanese character sets:
sjismm          same as sjis
eucjpmsmm       same as eucjpms
ujismm          same as ujis
cp932mm         same as cp932

There will be no 'mm' equivalents for the new character sets
described in the previous section (sjisjisx0221 etc.). The
Japanese users' suggestion was only for current character sets. 

The initials 'mm' stand for 'multiple mapping'.

Saying sjismm is "same as sjis" means the repertoire is
the same, the representation (code points) are the same,
the applicable ordering rules are the same. The only
difference is the way that conversions occur, that is,
when casting or converting to/from another character set such
as ucs2, the results may differ for 22 characters.

The mm character sets: Conversion Chart
---------------------------------------

In this conversion table, the ucs2 column is the
source, and the sjis/cp932/ujis/eucjpms/etc. columns
are the destination, that is, what the hexadecimal
result would be if we used convert(ucs2) or if we
assigned a ucs2 column containing the value to an
sjis/cp932/ujis/eucjpms column.

character name         ucs2 sjis sjis cp93 cp93 ujis   ujis   eucj   eucj
                                 mm   2    mm          mm     pms    pmsmm
----                   ---- ---- ---- ---- ---- ----   ----   ----   ----

BROKEN BAR             00A6   3F   3F   3F FA55 8FA2C3 8FA2C3   3F   8FA2C3
FULLWIDTH BROKEN BAR   FFE4   3F   3F FA55 FA55   3F   8FA2C3 8FA2C3 8FA2C3

YEN SIGN               00A5   3F   5C   3F   5C   20     5C    3F      5C
FULLWIDTH YEN SIGN     FFE5 818F    ? 818F    ? A1EF   A1EF    3F    A1EF

TILDE                  007E   7E   7E   7E   7E   7E     7E     7E     7E
OVERLINE               203E   3F   7E   3F   7E   20     7E     3F     7E

HORIZONTAL BAR         2015 815C 815C 815C 815C A1BD   A1BD   A1BD   A1BD
EM DASH                2014   3F 815C   3F 815C   3F   A1BD     3F   A1BD

REVERSE SOLIDUS        005C 815F   5C   5C   5C   5C     5C     5C     5C
FULLWIDTH ""           FF3C   3F 815F 815F 815F   3F   A1C0   A1C0   A1C0

WAVE DASH              301C 8160 8160   3F 8160 A1C1   A1C1     3F   A1C1
FULLWIDTH TILDE        FF5E   3F 8160 8160 8160   3F   A1C1   A1C1   A1C1

DOUBLE VERTICAL LINE   2016 8161 8161   3F 8161 A1C2   A1C2     3F   A1C2
PARALLEL TO            2225   3F 8161 8161 8161   3F   A1C2   A1C2   A1C2

MINUS SIGN             2212 817C 817C   3F 817E A1DD   A1DD     3F   A1DD
FULLWIDTH HYPHEN-MINUS FF0D   3F 817C 817C 817C   3F   A1DD   A1DD   A1DD

CENT SIGN              00A2 8191 8191   3F 8191 A1F1   A1F1     3F   A1F1
FULLWIDTH CENT SIGN    FFE0   3F 8191 8191 8191   3F   A1F1   A1F1   A1F1

POUND SIGN             00A3 8192 8192   3F 8192 A1F2   A1F2     3F   A1F2
FULLWIDTH POUND SIGN   FFE1   3F 8192 8192 8192   3F   A1F2   A1F2   A1F2

NOT SIGN               00AC 81CA 81CA   3F 81CA A2CC   A2CC     3F   A2CC
FULLWIDTH NOT SIGN     FFE2   3F 81CA 81CA 81CA   3F   A2CC   A2CC   A2CC

(No special rules for U+FFE3 FULLWIDTH MACRON or U+2026 HORIZONTAL ELLIPSIS.)

For example, consider this extract from the above table:
                       ucs2 sjis sjis
                            curr mm
                       ---- ---- ----
NOT SIGN               00AC 81CA 81CA
FULLWIDTH NOT SIGN     FFE2   3F 81CA
It means "for NOT SIGN which is Unicode U+00AC,
MySQL converts to sjis code point 0x81CA and
that's the same thing that Shuichi Tamagawa wants
(no change), but for FULLWIDTH NOT SIGN which
is Unicode U+FFE2, MySQL converts to sjis code
point 0x3F which is question mark "?" and that
is different from Mr Tamagawa's patch, which would
convert to 0x81CA."

Looking for a pattern in the table, you'll see
that the frequent desire is to replace 0x3F
with the closest fullwidth coding (if
sjis/ujis), or the closest not-fullwidth coding
(if cp932/eucjpms). That's loose and that is
not the universal rule, therefore we keep the
old way (in sjis etc.) but allow a new way
(in sjismm etc.).

The Japanese group did not propose changes for
the reverse direction, i.e. with destination
= ucs2, except for "sjis" 0x815F and "ujis" 0xA1C0.
I [Peter Gulutzan] will propose here a rule which
is compatible with the "sjis 0x85F / ujis A1C0" request,
for all cases:

When two possibilities exist, take the bigger.

For example, using NOT SIGN again, we see that
_sjismm 0x81ca can be converted to either
_ucs2 0x00ac or _ucs2 0xffe2. Choose _ucs2 0xffe2
because it is bigger. And similarly, since
_sjismm 0x81ca can be converted to either
_ujis 0xa2cc or _ujis 0x3f, choose _ujis 0xa2cc
because it is bigger. (The exception, of course,
is that 0x3f always converts to 0x3f and 0x20
always converts to 0x20.)

This does not help round tripping. But
that's okay, round tripping won't work
anyway.

The chart also shows two slight errors in the
current character sets:
U+00A5 converts to _ujis 20 instead of 3F
U+203E converts to _ujis 20 instead of 3F
But we will not fix these errors.

Minor note: if MySQL had an mm for the eucjp-open "ascii"
variant, then it would convert U+203E to A1B1.

Collations
----------

Collations for the new character sets will
be the same as the current character sets.
It might prevent confusion if we had different
names, for example sjismm_japanese_ci not
sjis_japanese_ci. But in fact the differing
characters are in the same positions.

Does that mean that we can say
... WHERE _sjis 'a' = _sjismm 'a'?
Answer: NO. That's easy but special rules confuse.

Advice
------

The variant character sets will be little used, so
save space on table lookups, and say that converting
with these character sets will be a little slower.

JIS 2004
--------

Japanese Industrial Standard X 0213 came out in 2000 (JIS X 0213:2000)
and was revised in 2004 (JIS X 0213:2004). A Wikipedia introduction is:
http://en.wikipedia.org/wiki/JIS_X_0213
Both JIS X 0213:2000 and JIS X 0213:2004 added characters outside the
Unicode BMP, but MySQL added support for non-BMP characters due to WL#1213
"Implement 4-byte UTF8, UTF16 and UTF32". JIS X 0213:2004, often
called "JIS 2004" or "JIS2004", is important because Vista supports it.

"Mapping Tables between JIS X 0213 and Unicode" for SJIS + UJIS are here:
http://x0213.org/codetable/index.en.html
It includes:
* Mapping table between Shift_JIS-2004 and Unicode
* Mapping table between EUC-JIS-2004 and Unicode
* Mapping table between ISO-2022-JP-2004 and Unicode
* Mapping table between JIS X 0213:2004 7-bit code and Unicode
For each new character, these documents state which version of 
JIS X 0213 and Unicode the character first appeared in.
So we have a clear specification.

The unsettled point is: should the new characters be added to the
existing SJIS and UJIS character sets, or should there be even
more character sets? Alexander Barkov and Peter Gulutzan discussed
this in May/June 2007
"Extending ujis wirth JIS X 0213 characters" (dev-public thread),
[ mysql intranet archive] /secure/mailarchive/mail.php?folder=5&mail=67744



References
----------

http://www.haible.de/bruno/charsets/conversion-tables/Japanese.html
http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html
http://www.opengroup.or.jp/jvc/cde/appendix-e.html
http://www.w3.org/TR/2000/NOTE-japanese-xml-20000414/
http://sources.redhat.com/ml/libc-alpha/2000-10/msg00190.html
http://x0213.org/codetable/index.en.html

http://www.debian.org/doc/manuals/intro-i18n/intro-i18n.txt

BUG#50934 hyphen is not mapped

A long series of emails about the mm suggestion is in thread
"RE: Feedback and Requests from Japanese users community"
(not in a public archive but available from Peter Gulutzan
or Alexander Barkov)

See also WL#2555 Standard Japanese collation support.