WL#3090: Japanese Character Set Adjustments

Affects: Server-5.5 — Status: Complete

Description
High Level Architecture
Low Level Design

For conversion between one Japanese character set and
another Japanese character set, use a JIS table (based
on JIS-X-0201 + JIS-X-0208 + JIS-X-0213 code points),
or a single Unicode table, rather than Unicode tables
and some 'if' statements. This approach will be faster.

Speed up conversions between Japanese character sets.
Since ujis + eucjpms + sjis + cp932 are all based on JIS
(Japanese Industrial Standard) repertoires, conversion
should be possible with algorithms or JIS-table lookups,
without requiring Unicode-table lookups.
Or, a Unicode-table lookup is possible without requiring
multiple statements.

This affects CAST(), CONVERT(), and any automatic
conversion due to assignment. It does not mean you can
compare sjis and cp932 strings without explicit conversion.

The preferred plan is in section "Main Proposal: Big Unicode Table".
The possible alternative plans are in sections "Another Proposal ...".
The implementor will pick the main proposal, having tested other plans.

Pick Pairs
----------

The possible pairs are:
ujis to eucjpms
ujis to sjis
ujis to cp932
eucjpms to ujis
eucjpms to sjis
eucjpms to cp932
sjis to ujis
sjis to eucjpms
sjis to cp932
cp932 to ujis
cp932 to eucjpms
cp932 to sjis

This worklog description only has sections for ujis to sjis
and sjis to cp932. However, given those pairs, the rest are
straightforward. The implementor should make some effort to
handle all pairs.

WL#1820 mentions 4 more Japanese character sets, and there
might be 4 more due to JIS X 0213:2004. So in theory someday
144 possible pairs. However, we will make no effort here to
allow for possible future character sets.

Requirement
-----------

For all characters, the results must be the same as
the results we get currently for conversion via Unicode.
And if a conversion is impossible then there will be
a warning or error, just as there is now in version 5.1.

This requirement can be discussed. If we have to bend it
for the sake of efficiency, we need to know how bad that
will be.

Main proposal: Big Unicode Table
--------------------------------

With a JIS table we can avoid Unicode intermediary lookups,
and thus save time. But there is another way to save time:
make the Unicode intermediary lookups faster. The point is
that there are many "if" statements in the Unicode lookups,
because we only have mappings for certain characters (the
other characters are either invalid or deducible). If we
expanded the table so that it included all possible characters
rather than just certain characters, we'd be able to eliminate
the "if"s and just do one table-lookup statement.

Specifically:
In functions like func_sjis_uni_onechar() or my_uni_jisx0208_onechar(),
replace "if"s like these:
   ...
   if ((code>=0x00A1)&&(code<=0x00DF))
     return(tab_cp932_uni0[code-0x00A1]);
   if ((code>=0x8140)&&(code<=0x84BE))
     return(tab_cp932_uni1[code-0x8140]);
   if ((code>=0x8740)&&(code<=0x879C))
     return(tab_cp932_uni2[code-0x8740]);
   ...
with a single "array operation" like this:
   ...
   code= tab_co932_uni[code - something];
   ...

The implementor will test whether "another proposal"
is at least as fast as "...". If so, there will be
no need to add mb_jis() or jis_mb() for each character set.

This will a very big table when support is added for
WL#1213 Supplementary Characters.

The "performance point of view" mainly depends on how lucky
we are with caching with this huge table.
But the invalid SJIS/UJIS values should be very rare (indeed
they shouldn't exist at all in a clean database). Therefore
they will never be looked up. Therefore, although the total
table size is much larger if we allow for all invalid values,
the amount that's actually used in lookups (and therefore the
amount that's cached) is not larger at all.
So it seems to Peter that Alexander Barkov's "Big Unicode Table
proposal" is always going to be faster with realistic data,
(He's also assuming that invalid values are clumped together,
rather than distributed evenly in the table, but that too
seems realistic to him.)

Another Proposal: JIS table
---------------------------

Before 2010-01-04 this was the "Main Proposal".

The current loop in strings/ctype-*.c looks like this:
while (!EOF)
{
  cs1->cset->wb_wc(&code); // scan Unicode character from src
  cs2->cset->wc_mb(&code); // put Unicode character to dst
}

The proposed loop looks like this:
while (!EOF)
{
  cs1->cset->wb_jis(&code); // scan JIS character from src
  cs2->cset->jis_mb(&code); // put JIS character to dst
}

That is, each Japanese character set handler will have a new
function wb_jis "scan a character from an sjis/ujis/eucjpms/cp932
tring and return its JIS code", and a new function jis_mb "put a
character with the given JIS code into an sjis/ujis/eucjpms/cp932
string".

The "JIS code" is in a table which contains values defined
by the various JIS standards. For example: 0xdf from JIS-X-0201,
0x2121 from JIS-X-0208, 0x???? from JIS-X-0213.

Since we avoid JIS-to-Unicode and Unicode-to-JIS table lookups,
performance is about twice as fast, according to some early tests.

The functions that should become faster are:
  sql/strfunc.cc strconvert().
  sql/sql_string.cc copy_and_convert().
  sql/sql_string.cc well_formed_copy_nchars().

Another Proposal: JIS Table: Examples
-------------------------------------

For example, when you convert from sjis to ujis:

1a. my_mb_wc_sjis() scans an SJIS representation of JIS-X-0208 code
1b. my_mb_wc_sjis() converts JIS-X-0208 code (in SJIS form) to Unicode
    using func_sjis_uni_onechar(), which is slow (uses table lookups)
    Then the found Unicode character code is returned.

then

2a. my_wc_mb_euc_jp() gets a Unicode code and converts it to JIS-X-0208
    using my_uni_jisx0208_onechar(), which is slow (uses table lookups)
2b. my_wc_mb_euc_jp() puts the found JIS-X-0208 character
    in the result string.

The slowest steps here are func_sjis_uni_onechar() and
my_uni_jisx0208_onechar(). I.e. conversion from JIS-X-0208 to Unicode,
and then conversion from Unicode back to JIS-X-0208.

If we use JIS-X-0208 instead of Unicode as intermediary, then
these two slow steps are not necessary.

An example file, jp.txt, attached to this worklog task,
demonstrates what sjis_jis() and jis_cp932() could look like.

Another Proposal: Algorithm for ujis/eucjpms to sjis/cp932
----------------------------------------------------------

The algorithm for moving from an EUC encoding (ujis or
eucjpms) to an SJIS encoding (sjis or cp932) is well known.
There is a description in Wikipedia:
http://en.wikipedia.org/wiki/Shift_JIS
It's possible because, although the encodings are different,
the underlying JIS "code points" are the same.

But the result can be an unassigned / reserved
SJIS character, that is, well-formed but invalid.
Example: _ujis aaaa = JIS 2a2a = _sjis 85a8.
The only ways around this difficulty are:
(1) ignore it, assume that is the ujis value was
    acceptable then the sjis value must be good too
(2) use a lookup table with one bit for each JIS
    value, with 0 = valid or 1 = invalid, so the
    table is only 1/16 as large
(3) strip the UJIS value to get the JIS value, but
    then do a lookup from the JIS value to the SJIS
    value.
(4) add more "if" statements, for example "if the
    first byte of the SJIS result is 0x85, it's bad".

The implementor will test to see whether the algorithm
is faster than table lookup. If so, we will then have
to choose one of the above "ways around this difficulty".

Another proposal: sjis to cp932
-------------------------------

The idea here never became a clear proposal.
We may remove this section after 2009-12-31.

Since cp932 is merely the Microsoft variant of sjis,
many characters are the same in both character sets,
and therefore need no conversion. For example,
_sjis 0x8ec7 = _cp932 0x8ec7.
Effectively the sjis-to-cp932 conversion can
happen, for some character strings, by just renaming.

There are 4408 sjis characters which currently cannot be converted
to cp932. This happens for three reasons:
1. The character is illegal in sjis. MySQL should never have accepted it.
   Bar suggested that MySQL should start rejecting such characters,
   but Peter resisted, saying that's a change in behaviour, a new task.
2. The character is legal in some sjis variant, but is not in
   the sjis-to-Unicode table.
3. The character is legal in sjis, and is in the sjis-to-unicode table,
   but sjis-to-unicode value differs from cp932-to-unicode value.
   Example: 815F, 8160, 8161, 817C, 8191, 8192, 81CA.
   There are seven characters in this category, the differences are
   halfwidth versus fullwidth etc., and they are mentioned in the MySQL
   Reference Manual
   http://dev.mysql.com/doc/refman/5.1/en/charset-asian-sets.html
   Specifically for 81CA (NOT SIGN) see the FAQ for the manual:
   http://dev.mysql.com/doc/refman/5.1/en/faqs-cjk.html

We still need a table. But it can be only a small table, containing
only the characters which cause conversion difficulty.

There is a test for the 4408 characters which cannot be
converted from sjis to cp932, in this email thread:
[ mysql intranet ] /secure/mailarchive/mail.php?folder=4&mail=30324

Results of testing
------------------

In December 2009 Alexander Barkov wrote a program which compares
speeds of some of the 'proposals' described above. The program
and the results are attached to this worklog task as file attachment
'wl3090.tar.gz'.

The tests do show that on some modern Linux machines the 'Big Unicode
Tables' method is faster than alternatives. So that is what we
decide on.

Let's admit the following:
   * Maybe one could get better results from the
     "[bit] Algorithm" method by working on it for several days.
     ... But it's not worth anyone's time.
   * The original "Main Proposal: JIS table", which Mr Barkov
     did not test, would obviously be faster than "Big Unicode Table".
     ... But JIS Table is applicable to fewer conversion pairs.
   * The "Big Unicode Table" has one disadvantage compared to
     the original, namely, it wastes lots of program data space.
     ... But we don't care.
   * This is not what the original request from Japanese users
     looked like, and not absolutely the best sjis/cp932 converter.
     ... But Yoshinori Matsunobu has made no objection to that.
   The main thing, the Bullet Point to use for justifying the
   work, is: Initial Tests Show New Method Should Be More Than 10%
   Faster Than Old Method. This can be a QA criterion.

Let's make it clear: the only thing we do
for WL#3090, now and forever, is Big Unicode Table.

Feedback
--------

We asked for feedback re WL#3090 requirements several months ago.
What we got (excluding comments from Dean + Susanne) was this:
"Yoshinori says: Must have the part for sjis-cp932 conversion"
"[Bar says] "Not important. Postpone for a higher 6.x or 7.x"
and Shinobu Matsuzuka rated the task "P2" with no further comment.

Cancelled subtasks
------------------

This section is obsolete and may be removed.
It concerns subtasks which were part of the original
proposal, which we decided against for reasons given here.
We may remove this section after 2009-12-31.

1. Change --skip-character-set-client-handshake    
   using my.cnf. 
   CANCELLED. See progress notes for original description. 
   Shuichi didn't get feedback from Japanese community. 

2. The following should be an error, not a warning:    
    
   mysql> create table tj (s1 char(10) character set sjis);    
   Query OK, 0 rows affected (0.52 sec)    
    
   mysql> set sql_mode=ansi;    
   Query OK, 0 rows affected (0.03 sec)    
    
   mysql> insert into tj values (0x8080);    
   Query OK, 1 row affected, 1 warning (0.00 sec)    
    
   (The above causes a warning 1366 "Incorrect
   string value ..." which is an error if strict.)

   What they really want is that we accept the    
   junk character, but I can't think of a good    
   way, and they indicated that "at least" an    
   error should be there.   

   MOVED. WL#5083 Error for character set conversion failure.

4. Allow Shift_JIS as an alias for sjis    
   Allow EUC-JP as an alias for eucjp 
 
   CANCELLED. See progress notes for original description. 
   Shuichi didn't get feedback from Japanese community. 

References
----------
   
There was previous discussion of this worklog task   
in email thread "Feedback and Requests from Japanese   
users community" with participants: shuichi, pgulutzan,   
bar, and others.   

There is also a dev-private thread "Re: WL#3090 Japanese
Character Set Adjustments".