WL#4583: Case conversion in Asian character sets

Affects: Server-5.5   —   Status: Complete   —   Priority: Medium

This task is a part of Sun Globalization Requirements.

MySQL provides only partial support for case conversion
in Asian CJK (Chinese Japanese Korean) character sets.

Functions UPPER(), UCASE(), LOWER(), LCASE() work only
for the basic Latin letters: A-Z / a-z. They do not work
for other letters, including FULLWIDTH Latin A-z / a-z,
Latin with accents, Greek, or Cyrillic. For most such
letters, for affected character sets, UPPER|LOWER does nothing.

The Unicode character sets are not affected, UPPER|LOWER
functions work with them. So the objective is to make UPPER|LOWER
functions work the same way in CJK character sets as in Unicode
character sets, for every UCS2 character which can be
converted to a CJK character. (We assume that supplementary
characters are either not likely to be in the CJK character sets
or not likely to have both upper and lower forms.)

Affected character sets
-----------------------

Affected character sets are all Asian character sets
which currently contain 'chinese' or 'japanese' or 'korean'
in the description column of information_schema.character_sets.
Tier 1: UJIS SJIS GB2312 CP932 EUCJPMS
Tier 2: BIG5 EUCKR GBK

Other "Asian" character sets, for example tis620, are not affected.

Affected characters
-------------------

The script that fetches all affected characters for these eight character
sets and the script output can be found in the "File attachments" section
of this task.

In the descriptions below, the phrase "Only upper" means that the
listed characters appear only in upper case, and the phrase "Only lower"
means that the listed characters appear only in lower case.
For "Only upper" or "Only lower" characters there can be no
case conversion, therefore the behaviour of UPPER() and LOWER()
will work differently from Unicode for these characters.

Character Set Affected characters
big5          FULLWIDTH LATIN, GREEK, CYRILLIC
              Only upper:
                U+2160 to U+2169 ROMAN NUMERAL ONE to ROMAN NUMERAL TEN
              Only lower:
                U+0430 to U+0433 CYRILLIC SMALL LETTER A BE VE GHE
                U+043D to U+0442 CYRILLIC SMALL LETTER EN O PE ER ES TE

cp932         ROMAN NUMERAL, FULLWIDTH LATIN, GREEK, CYRILLIC
              Note: ROMAN NUMERAL and SMALL ROMAN NUMERAL
              have double encoding in cp932. We will map as follows:
              ROMAN NUMERAL 8754-875D to SMALL ROMAN NUMERAL EEEF-EEF8
              ROMAN NUMERAL FA4A-FA53 to SMALL ROMAN NUMERAL FA40-FA49
              Only upper:
                U+212B ANGSTROM SIGN

sjis          FULLWIDTH LATIN, GREEK, CYRILLIC
              Only upper:
                U+212B ANGSTROM SIGN

euckr         FULLWIDTH LATIN, GREEK, CYRILLIC, some extended Latin letters
              Only upper:
                U+212B ANGSTROM SIGN
              Only lower:
                U+0111 LATIN SMALL LETTER D WITH STROKE
                U+24D0 to U+24E9 CIRCLED LATIN SMALL LETTER A to Z

gb2312        FULLWIDTH LATIN, GREEK, CYRILLIC
              Only upper:
                U+2160 to U+2169 ROMAN NUMERAL ONE to ROMAN NUMERAL TEN
              Only lower:
                as in gbk (gb2312 is a subset of gbk)

gbk           FULLWIDTH LATIN, GREEK, CYRILLIC
              Only upper:
                U+216A ROMAN NUMERAL ELEVEN
                U+216B ROMAN NUMERAL TWELVE
              Only lower:
                U+00E0 LATIN SMALL LETTER A WITH GRAVE
                U+00E1 LATIN SMALL LETTER A WITH ACUTE
                U+00E8 LATIN SMALL LETTER E WITH GRAVE
                U+00E9 LATIN SMALL LETTER E WITH ACUTE
                U+00EA LATIN SMALL LETTER E WITH CIRCUMFLEX
                U+00EC LATIN SMALL LETTER I WITH GRAVE
                U+00ED LATIN SMALL LETTER I WITH ACUTE
                U+00F2 LATIN SMALL LETTER O WITH GRAVE
                U+00F3 LATIN SMALL LETTER O WITH ACUTE
                U+00F9 LATIN SMALL LETTER U WITH GRAVE
                U+00FA LATIN SMALL LETTER U WITH ACUTE
                U+00FC LATIN SMALL LETTER U WITH DIAERESIS
                U+0101 LATIN SMALL LETTER A WITH MACRON
                U+0113 LATIN SMALL LETTER E WITH MACRON
                U+011B LATIN SMALL LETTER E WITH CARON
                U+012B LATIN SMALL LETTER I WITH MACRON
                U+0144 LATIN SMALL LETTER N WITH ACUTE
                U+0148 LATIN SMALL LETTER N WITH CARON
                U+014D LATIN SMALL LETTER O WITH MACRON
                U+016B LATIN SMALL LETTER U WITH MACRON
                U+01CE LATIN SMALL LETTER A WITH CARON
                U+01D0 LATIN SMALL LETTER I WITH CARON
                U+01D2 LATIN SMALL LETTER O WITH CARON
                U+01D4 LATIN SMALL LETTER U WITH CARON
                U+01D6 LATIN SMALL LETTER U WITH DIAERESIS AND MACRON
                U+01D8 LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE
                U+01DA LATIN SMALL LETTER U WITH DIAERESIS AND CARON
                U+01DC LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE

eucjpms       ANGSTROM SIGN, FULLWIDTH LATIN, GREEK, CYRILLIC
              Only upper:
                U+0122 LATIN CAPITAL LETTER G WITH CEDILLA
              Only lower:
                U+00F0 LATIN SMALL LETTER ETH
                U+01F5 LATIN SMALL LETTER G WITH ACUTE

ujis          ANGSTROM SIGN, FULLWIDTH LATIN, GREEK, CYRILLIC
              Only upper:
                U+0122 LATIN CAPITAL LETTER G WITH CEDILLA
              Only lower:
                U+00F0 LATIN SMALL LETTER ETH
                U+01F5 LATIN SMALL LETTER G WITH ACUTE

Example
-------

For example, these expressions will return TRUE for all CP932 characters:

SELECT UPPER(cp932_expression) = CONVERT(UPPER(CONVERT(cp932_expression using
ucs2)) USING cp932); -> TRUE

SELECT LOWER(cp932_expression) = CONVERT(LOWER(CONVERT(cp932_expression using
ucs2)) USING cp932); -> TRUE

It works the same for all character sets listed in section "Affected
character sets".

Observations
------------

There will be no change with respect to case-insensitive comparisons or indexes.

There will be no upgrade/downgrade path. We do not expect that the effect
will break a significant number of existing applications. The current
behaviour, which is not documented, does not cause results which people
who are familiar with the characters would consider 'reasonable'.

Test
----

MySQL must pass the following test when the task is complete.

For every UCS2 character, for every Asian-language character set,
if the UCS2 character can be converted into an Asian-language-character-set
character, but (UPPER|LOWER(UCS2 character) converted to
Asian-language-character-set character is not equal to
UPPER|LOWER(UCS2 character converted to Asian-language-character-set-character),
then display a message showing the character set and the character.
If anything is displayed, the test fails.

[ Update 2009-11-27 ] The test procedure was changed to accommodate the
exceptional cases for 'Upper' and 'Lower' described in section Affected
Characters. ]

DELIMITER //
DROP PROCEDURE IF EXISTS p//
CREATE PROCEDURE p ()
BEGIN
  DECLARE ucs2_code_point INT;
  DECLARE eof INT DEFAULT 0;
  DECLARE v_character_set_name VARCHAR(32);
  DECLARE cs CURSOR FOR
    SELECT character_set_name
    FROM information_schema.character_sets
    WHERE description LIKE '%chinese%' OR description LIKE '%japanese%' OR
description LIKE '%korean%';
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET eof = 1;
  OPEN cs;
  WHILE eof = 0 DO
    FETCH cs INTO v_character_set_name;
    IF eof = 0 THEN
      DROP TABLE IF EXISTS t;
      SET @x = CONCAT('CREATE TABLE t (ucs2 CHAR(1) CHARACTER SET ucs2,
                       c CHAR(1), c_upper CHAR(1), c_lower CHAR(1))
                       CHARACTER SET ',
                       v_character_set_name);
      PREPARE stmt1 FROM @x;
      EXECUTE stmt1;
      INSERT INTO t (ucs2) VALUES (0x0000);
      SET ucs2_code_point = 0;
      WHILE ucs2_code_point <= 65535 DO
        SET @x = CONCAT('UPDATE t SET ucs2 = 0x',HEX(ucs2_code_point));
        PREPARE stmt1 FROM @x;
        EXECUTE stmt1;
        UPDATE t SET c = ucs2, c_upper=UPPER(ucs2), c_lower=LOWER(ucs2);
        SET @failure = 0;
        IF (SELECT COUNT(*) FROM t
            WHERE c = '?' OR (UPPER(c) = c_upper AND LOWER(c) = c_lower)) <> 1 THEN
          SET @failure = 1;
          END IF;
        IF v_character_set_name = 'big5' and HEX(ucs2_code_point) IN
        ('430','431','432','433','43d','43e','43f',
        '440','441','442','2160','2161','2162','2163','2164',
        '2165','2166','2167','2168','2169') THEN
          SET @failure = 0;
          END IF;
        IF v_character_set_name = 'ujis' AND HEX(ucs2_code_point) IN
        ('F0','122','1F5') THEN
          SET @failure = 0;
          END IF;
        IF v_character_set_name = 'sjis' AND HEX(ucs2_code_point) IN
        ('212B') THEN
          SET @failure = 0;
          END IF;
        IF v_character_set_name = 'euckr' AND HEX(ucs2_code_point) IN
        ('111','212B','24d0','24d1','24d2',
        '24d3','24d4','24d5','24d6','24d7',
        '24d8','24d9','24dA','24DB','24DC','24DD',
        '24DE','24DF','24E0','24E1','24E2',
        '24E3','24E4','24E5','24E6','24E7',
        '24E8','24E9') THEN
          SET @failure = 0;
          END IF;
        IF v_character_set_name = 'gb2312' AND HEX(ucs2_code_point) IN
        ('E0','E1','E8','E8','E9','EA',
        'EC','ED','F2','F3','F9',
        'FA','FC','101','113','11B',
        '12B','14D','16B',
        '1CE','1D0','1D2','1D4','1D6',
        '1D8','1DA','1DC',
        '2160','2161','2162','2163','2164',
        '2165','2166','2167','2168','2169',
        '216A','216B') THEN
          SET @failure = 0;
          END IF;
        IF v_character_set_name = 'gbk' AND HEX(ucs2_code_point) IN
        ('E0','E1','E8','E8','E9','EA',
        'EC','ED','F2','F3','F9',
        'FA','FC','101','113','11B',
        '12B','144','148','14D','16B',
        '1CE','1D0','1D2','1D4','1D6',
        '1D8','1DA','1DC','216A','216B') THEN
          SET @failure = 0;
          END IF;
        IF v_character_set_name = 'cp932' AND HEX(ucs2_code_point) IN
        ('212B','2160','2161','2162','2163','2164',
        '2165','2166','2167','2168','2169',
        '2170','2171','2172','2173','2174',
        '2175','2176','2177','2178','2179') THEN
          SET @failure = 0;
          END IF;
        IF v_character_set_name = 'eucjpms' and HEX(ucs2_code_point) IN
        ('F0','122','1F5','2170','2171','2172','2173','2174',
        '2175','2176','2177','2178','2179') THEN
          SET @failure = 0;
          END IF;
        IF @failure = 1 THEN
          SELECT 'Failure!', v_character_set_name,
                 HEX(ucs2), HEX(c), HEX(c_upper), HEX(c_lower),
                 HEX(UPPER(c)), HEX(LOWER(c))
                 FROM t;
          END IF;
        SET ucs2_code_point = ucs2_code_point + 1;
        END WHILE;
      END IF;
    END WHILE;
  CLOSE cs;
  END//
CALL p()//



Examples of current failures with MySQL 6.0:

+----------+----------------------+-----------+--------+--------------+
| Failure! | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) |
+----------+----------------------+-----------+--------+--------------+
| Failure! | big5                 | 044C      | C7E5   | C7C4         |
+----------+----------------------+-----------+--------+--------------+

--------------+---------------+---------------+
 HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
--------------+---------------+---------------+
 C7E5         | C7E5          | C7E5          |
--------------+---------------+---------------+

1 row in set (0.10 sec)

...

+----------+----------------------+-----------+--------+--------------+
| Failure! | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) |
+----------+----------------------+-----------+--------+--------------+
| Failure! | sjis                 | 0451      | 8476   | 8446         |
+----------+----------------------+-----------+--------+--------------+

--------------+---------------+---------------+
 HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
--------------+---------------+---------------+
 8476         | 8456          | 8476          |
--------------+---------------+---------------+

1 row in set (0.53 sec)


(The SJIS example may be considered to be evidence of a bug.)

The test can take many minutes.

References
----------

These sources were used when checking the 'Only upper' and
'Only lower' assertions in the Affected Characters section.
http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
http://www.rikai.com/library/kanjitables/kanji_codes.euc.shtml