WL#3090: Japanese Character Set Adjustments
Affects: Server-5.5
—
Status: Complete
For conversion between one Japanese character set and another Japanese character set, use a JIS table (based on JIS-X-0201 + JIS-X-0208 + JIS-X-0213 code points), or a single Unicode table, rather than Unicode tables and some 'if' statements. This approach will be faster.
Speed up conversions between Japanese character sets. Since ujis + eucjpms + sjis + cp932 are all based on JIS (Japanese Industrial Standard) repertoires, conversion should be possible with algorithms or JIS-table lookups, without requiring Unicode-table lookups. Or, a Unicode-table lookup is possible without requiring multiple statements. This affects CAST(), CONVERT(), and any automatic conversion due to assignment. It does not mean you can compare sjis and cp932 strings without explicit conversion. The preferred plan is in section "Main Proposal: Big Unicode Table". The possible alternative plans are in sections "Another Proposal ...". The implementor will pick the main proposal, having tested other plans. Pick Pairs ---------- The possible pairs are: ujis to eucjpms ujis to sjis ujis to cp932 eucjpms to ujis eucjpms to sjis eucjpms to cp932 sjis to ujis sjis to eucjpms sjis to cp932 cp932 to ujis cp932 to eucjpms cp932 to sjis This worklog description only has sections for ujis to sjis and sjis to cp932. However, given those pairs, the rest are straightforward. The implementor should make some effort to handle all pairs. WL#1820 mentions 4 more Japanese character sets, and there might be 4 more due to JIS X 0213:2004. So in theory someday 144 possible pairs. However, we will make no effort here to allow for possible future character sets. Requirement ----------- For all characters, the results must be the same as the results we get currently for conversion via Unicode. And if a conversion is impossible then there will be a warning or error, just as there is now in version 5.1. This requirement can be discussed. If we have to bend it for the sake of efficiency, we need to know how bad that will be. Main proposal: Big Unicode Table -------------------------------- With a JIS table we can avoid Unicode intermediary lookups, and thus save time. But there is another way to save time: make the Unicode intermediary lookups faster. The point is that there are many "if" statements in the Unicode lookups, because we only have mappings for certain characters (the other characters are either invalid or deducible). If we expanded the table so that it included all possible characters rather than just certain characters, we'd be able to eliminate the "if"s and just do one table-lookup statement. Specifically: In functions like func_sjis_uni_onechar() or my_uni_jisx0208_onechar(), replace "if"s like these: ... if ((code>=0x00A1)&&(code<=0x00DF)) return(tab_cp932_uni0[code-0x00A1]); if ((code>=0x8140)&&(code<=0x84BE)) return(tab_cp932_uni1[code-0x8140]); if ((code>=0x8740)&&(code<=0x879C)) return(tab_cp932_uni2[code-0x8740]); ... with a single "array operation" like this: ... code= tab_co932_uni[code - something]; ... The implementor will test whether "another proposal" is at least as fast as "...". If so, there will be no need to add mb_jis() or jis_mb() for each character set. This will a very big table when support is added for WL#1213 Supplementary Characters. The "performance point of view" mainly depends on how lucky we are with caching with this huge table. But the invalid SJIS/UJIS values should be very rare (indeed they shouldn't exist at all in a clean database). Therefore they will never be looked up. Therefore, although the total table size is much larger if we allow for all invalid values, the amount that's actually used in lookups (and therefore the amount that's cached) is not larger at all. So it seems to Peter that Alexander Barkov's "Big Unicode Table proposal" is always going to be faster with realistic data, (He's also assuming that invalid values are clumped together, rather than distributed evenly in the table, but that too seems realistic to him.) Another Proposal: JIS table --------------------------- Before 2010-01-04 this was the "Main Proposal". The current loop in strings/ctype-*.c looks like this: while (!EOF) { cs1->cset->wb_wc(&code); // scan Unicode character from src cs2->cset->wc_mb(&code); // put Unicode character to dst } The proposed loop looks like this: while (!EOF) { cs1->cset->wb_jis(&code); // scan JIS character from src cs2->cset->jis_mb(&code); // put JIS character to dst } That is, each Japanese character set handler will have a new function wb_jis "scan a character from an sjis/ujis/eucjpms/cp932 tring and return its JIS code", and a new function jis_mb "put a character with the given JIS code into an sjis/ujis/eucjpms/cp932 string". The "JIS code" is in a table which contains values defined by the various JIS standards. For example: 0xdf from JIS-X-0201, 0x2121 from JIS-X-0208, 0x???? from JIS-X-0213. Since we avoid JIS-to-Unicode and Unicode-to-JIS table lookups, performance is about twice as fast, according to some early tests. The functions that should become faster are: sql/strfunc.cc strconvert(). sql/sql_string.cc copy_and_convert(). sql/sql_string.cc well_formed_copy_nchars(). Another Proposal: JIS Table: Examples ------------------------------------- For example, when you convert from sjis to ujis: 1a. my_mb_wc_sjis() scans an SJIS representation of JIS-X-0208 code 1b. my_mb_wc_sjis() converts JIS-X-0208 code (in SJIS form) to Unicode using func_sjis_uni_onechar(), which is slow (uses table lookups) Then the found Unicode character code is returned. then 2a. my_wc_mb_euc_jp() gets a Unicode code and converts it to JIS-X-0208 using my_uni_jisx0208_onechar(), which is slow (uses table lookups) 2b. my_wc_mb_euc_jp() puts the found JIS-X-0208 character in the result string. The slowest steps here are func_sjis_uni_onechar() and my_uni_jisx0208_onechar(). I.e. conversion from JIS-X-0208 to Unicode, and then conversion from Unicode back to JIS-X-0208. If we use JIS-X-0208 instead of Unicode as intermediary, then these two slow steps are not necessary. An example file, jp.txt, attached to this worklog task, demonstrates what sjis_jis() and jis_cp932() could look like. Another Proposal: Algorithm for ujis/eucjpms to sjis/cp932 ---------------------------------------------------------- The algorithm for moving from an EUC encoding (ujis or eucjpms) to an SJIS encoding (sjis or cp932) is well known. There is a description in Wikipedia: http://en.wikipedia.org/wiki/Shift_JIS It's possible because, although the encodings are different, the underlying JIS "code points" are the same. But the result can be an unassigned / reserved SJIS character, that is, well-formed but invalid. Example: _ujis aaaa = JIS 2a2a = _sjis 85a8. The only ways around this difficulty are: (1) ignore it, assume that is the ujis value was acceptable then the sjis value must be good too (2) use a lookup table with one bit for each JIS value, with 0 = valid or 1 = invalid, so the table is only 1/16 as large (3) strip the UJIS value to get the JIS value, but then do a lookup from the JIS value to the SJIS value. (4) add more "if" statements, for example "if the first byte of the SJIS result is 0x85, it's bad". The implementor will test to see whether the algorithm is faster than table lookup. If so, we will then have to choose one of the above "ways around this difficulty". Another proposal: sjis to cp932 ------------------------------- The idea here never became a clear proposal. We may remove this section after 2009-12-31. Since cp932 is merely the Microsoft variant of sjis, many characters are the same in both character sets, and therefore need no conversion. For example, _sjis 0x8ec7 = _cp932 0x8ec7. Effectively the sjis-to-cp932 conversion can happen, for some character strings, by just renaming. There are 4408 sjis characters which currently cannot be converted to cp932. This happens for three reasons: 1. The character is illegal in sjis. MySQL should never have accepted it. Bar suggested that MySQL should start rejecting such characters, but Peter resisted, saying that's a change in behaviour, a new task. 2. The character is legal in some sjis variant, but is not in the sjis-to-Unicode table. 3. The character is legal in sjis, and is in the sjis-to-unicode table, but sjis-to-unicode value differs from cp932-to-unicode value. Example: 815F, 8160, 8161, 817C, 8191, 8192, 81CA. There are seven characters in this category, the differences are halfwidth versus fullwidth etc., and they are mentioned in the MySQL Reference Manual http://dev.mysql.com/doc/refman/5.1/en/charset-asian-sets.html Specifically for 81CA (NOT SIGN) see the FAQ for the manual: http://dev.mysql.com/doc/refman/5.1/en/faqs-cjk.html We still need a table. But it can be only a small table, containing only the characters which cause conversion difficulty. There is a test for the 4408 characters which cannot be converted from sjis to cp932, in this email thread: [ mysql intranet ] /secure/mailarchive/mail.php?folder=4&mail=30324 Results of testing ------------------ In December 2009 Alexander Barkov wrote a program which compares speeds of some of the 'proposals' described above. The program and the results are attached to this worklog task as file attachment 'wl3090.tar.gz'. The tests do show that on some modern Linux machines the 'Big Unicode Tables' method is faster than alternatives. So that is what we decide on. Let's admit the following: * Maybe one could get better results from the "[bit] Algorithm" method by working on it for several days. ... But it's not worth anyone's time. * The original "Main Proposal: JIS table", which Mr Barkov did not test, would obviously be faster than "Big Unicode Table". ... But JIS Table is applicable to fewer conversion pairs. * The "Big Unicode Table" has one disadvantage compared to the original, namely, it wastes lots of program data space. ... But we don't care. * This is not what the original request from Japanese users looked like, and not absolutely the best sjis/cp932 converter. ... But Yoshinori Matsunobu has made no objection to that. The main thing, the Bullet Point to use for justifying the work, is: Initial Tests Show New Method Should Be More Than 10% Faster Than Old Method. This can be a QA criterion. Let's make it clear: the only thing we do for WL#3090, now and forever, is Big Unicode Table. Feedback -------- We asked for feedback re WL#3090 requirements several months ago. What we got (excluding comments from Dean + Susanne) was this: "Yoshinori says: Must have the part for sjis-cp932 conversion" "[Bar says] "Not important. Postpone for a higher 6.x or 7.x" and Shinobu Matsuzuka rated the task "P2" with no further comment. Cancelled subtasks ------------------ This section is obsolete and may be removed. It concerns subtasks which were part of the original proposal, which we decided against for reasons given here. We may remove this section after 2009-12-31. 1. Change --skip-character-set-client-handshake using my.cnf. CANCELLED. See progress notes for original description. Shuichi didn't get feedback from Japanese community. 2. The following should be an error, not a warning: mysql> create table tj (s1 char(10) character set sjis); Query OK, 0 rows affected (0.52 sec) mysql> set sql_mode=ansi; Query OK, 0 rows affected (0.03 sec) mysql> insert into tj values (0x8080); Query OK, 1 row affected, 1 warning (0.00 sec) (The above causes a warning 1366 "Incorrect string value ..." which is an error if strict.) What they really want is that we accept the junk character, but I can't think of a good way, and they indicated that "at least" an error should be there. MOVED. WL#5083 Error for character set conversion failure. 4. Allow Shift_JIS as an alias for sjis Allow EUC-JP as an alias for eucjp CANCELLED. See progress notes for original description. Shuichi didn't get feedback from Japanese community. References ---------- There was previous discussion of this worklog task in email thread "Feedback and Requests from Japanese users community" with participants: shuichi, pgulutzan, bar, and others. There is also a dev-private thread "Re: WL#3090 Japanese Character Set Adjustments".
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.