WL#4583: Case conversion in Asian character sets
Affects: Server-5.5
—
Status: Complete
This task is a part of Sun Globalization Requirements. MySQL provides only partial support for case conversion in Asian CJK (Chinese Japanese Korean) character sets. Functions UPPER(), UCASE(), LOWER(), LCASE() work only for the basic Latin letters: A-Z / a-z. They do not work for other letters, including FULLWIDTH Latin A-z / a-z, Latin with accents, Greek, or Cyrillic. For most such letters, for affected character sets, UPPER|LOWER does nothing. The Unicode character sets are not affected, UPPER|LOWER functions work with them. So the objective is to make UPPER|LOWER functions work the same way in CJK character sets as in Unicode character sets, for every UCS2 character which can be converted to a CJK character. (We assume that supplementary characters are either not likely to be in the CJK character sets or not likely to have both upper and lower forms.) Affected character sets ----------------------- Affected character sets are all Asian character sets which currently contain 'chinese' or 'japanese' or 'korean' in the description column of information_schema.character_sets. Tier 1: UJIS SJIS GB2312 CP932 EUCJPMS Tier 2: BIG5 EUCKR GBK Other "Asian" character sets, for example tis620, are not affected. Affected characters ------------------- The script that fetches all affected characters for these eight character sets and the script output can be found in the "File attachments" section of this task. In the descriptions below, the phrase "Only upper" means that the listed characters appear only in upper case, and the phrase "Only lower" means that the listed characters appear only in lower case. For "Only upper" or "Only lower" characters there can be no case conversion, therefore the behaviour of UPPER() and LOWER() will work differently from Unicode for these characters. Character Set Affected characters big5 FULLWIDTH LATIN, GREEK, CYRILLIC Only upper: U+2160 to U+2169 ROMAN NUMERAL ONE to ROMAN NUMERAL TEN Only lower: U+0430 to U+0433 CYRILLIC SMALL LETTER A BE VE GHE U+043D to U+0442 CYRILLIC SMALL LETTER EN O PE ER ES TE cp932 ROMAN NUMERAL, FULLWIDTH LATIN, GREEK, CYRILLIC Note: ROMAN NUMERAL and SMALL ROMAN NUMERAL have double encoding in cp932. We will map as follows: ROMAN NUMERAL 8754-875D to SMALL ROMAN NUMERAL EEEF-EEF8 ROMAN NUMERAL FA4A-FA53 to SMALL ROMAN NUMERAL FA40-FA49 Only upper: U+212B ANGSTROM SIGN sjis FULLWIDTH LATIN, GREEK, CYRILLIC Only upper: U+212B ANGSTROM SIGN euckr FULLWIDTH LATIN, GREEK, CYRILLIC, some extended Latin letters Only upper: U+212B ANGSTROM SIGN Only lower: U+0111 LATIN SMALL LETTER D WITH STROKE U+24D0 to U+24E9 CIRCLED LATIN SMALL LETTER A to Z gb2312 FULLWIDTH LATIN, GREEK, CYRILLIC Only upper: U+2160 to U+2169 ROMAN NUMERAL ONE to ROMAN NUMERAL TEN Only lower: as in gbk (gb2312 is a subset of gbk) gbk FULLWIDTH LATIN, GREEK, CYRILLIC Only upper: U+216A ROMAN NUMERAL ELEVEN U+216B ROMAN NUMERAL TWELVE Only lower: U+00E0 LATIN SMALL LETTER A WITH GRAVE U+00E1 LATIN SMALL LETTER A WITH ACUTE U+00E8 LATIN SMALL LETTER E WITH GRAVE U+00E9 LATIN SMALL LETTER E WITH ACUTE U+00EA LATIN SMALL LETTER E WITH CIRCUMFLEX U+00EC LATIN SMALL LETTER I WITH GRAVE U+00ED LATIN SMALL LETTER I WITH ACUTE U+00F2 LATIN SMALL LETTER O WITH GRAVE U+00F3 LATIN SMALL LETTER O WITH ACUTE U+00F9 LATIN SMALL LETTER U WITH GRAVE U+00FA LATIN SMALL LETTER U WITH ACUTE U+00FC LATIN SMALL LETTER U WITH DIAERESIS U+0101 LATIN SMALL LETTER A WITH MACRON U+0113 LATIN SMALL LETTER E WITH MACRON U+011B LATIN SMALL LETTER E WITH CARON U+012B LATIN SMALL LETTER I WITH MACRON U+0144 LATIN SMALL LETTER N WITH ACUTE U+0148 LATIN SMALL LETTER N WITH CARON U+014D LATIN SMALL LETTER O WITH MACRON U+016B LATIN SMALL LETTER U WITH MACRON U+01CE LATIN SMALL LETTER A WITH CARON U+01D0 LATIN SMALL LETTER I WITH CARON U+01D2 LATIN SMALL LETTER O WITH CARON U+01D4 LATIN SMALL LETTER U WITH CARON U+01D6 LATIN SMALL LETTER U WITH DIAERESIS AND MACRON U+01D8 LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE U+01DA LATIN SMALL LETTER U WITH DIAERESIS AND CARON U+01DC LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE eucjpms ANGSTROM SIGN, FULLWIDTH LATIN, GREEK, CYRILLIC Only upper: U+0122 LATIN CAPITAL LETTER G WITH CEDILLA Only lower: U+00F0 LATIN SMALL LETTER ETH U+01F5 LATIN SMALL LETTER G WITH ACUTE ujis ANGSTROM SIGN, FULLWIDTH LATIN, GREEK, CYRILLIC Only upper: U+0122 LATIN CAPITAL LETTER G WITH CEDILLA Only lower: U+00F0 LATIN SMALL LETTER ETH U+01F5 LATIN SMALL LETTER G WITH ACUTE Example ------- For example, these expressions will return TRUE for all CP932 characters: SELECT UPPER(cp932_expression) = CONVERT(UPPER(CONVERT(cp932_expression using ucs2)) USING cp932); -> TRUE SELECT LOWER(cp932_expression) = CONVERT(LOWER(CONVERT(cp932_expression using ucs2)) USING cp932); -> TRUE It works the same for all character sets listed in section "Affected character sets". Observations ------------ There will be no change with respect to case-insensitive comparisons or indexes. There will be no upgrade/downgrade path. We do not expect that the effect will break a significant number of existing applications. The current behaviour, which is not documented, does not cause results which people who are familiar with the characters would consider 'reasonable'. Test ---- MySQL must pass the following test when the task is complete. For every UCS2 character, for every Asian-language character set, if the UCS2 character can be converted into an Asian-language-character-set character, but (UPPER|LOWER(UCS2 character) converted to Asian-language-character-set character is not equal to UPPER|LOWER(UCS2 character converted to Asian-language-character-set-character), then display a message showing the character set and the character. If anything is displayed, the test fails. [ Update 2009-11-27 ] The test procedure was changed to accommodate the exceptional cases for 'Upper' and 'Lower' described in section Affected Characters. ] DELIMITER // DROP PROCEDURE IF EXISTS p// CREATE PROCEDURE p () BEGIN DECLARE ucs2_code_point INT; DECLARE eof INT DEFAULT 0; DECLARE v_character_set_name VARCHAR(32); DECLARE cs CURSOR FOR SELECT character_set_name FROM information_schema.character_sets WHERE description LIKE '%chinese%' OR description LIKE '%japanese%' OR description LIKE '%korean%'; DECLARE CONTINUE HANDLER FOR NOT FOUND SET eof = 1; OPEN cs; WHILE eof = 0 DO FETCH cs INTO v_character_set_name; IF eof = 0 THEN DROP TABLE IF EXISTS t; SET @x = CONCAT('CREATE TABLE t (ucs2 CHAR(1) CHARACTER SET ucs2, c CHAR(1), c_upper CHAR(1), c_lower CHAR(1)) CHARACTER SET ', v_character_set_name); PREPARE stmt1 FROM @x; EXECUTE stmt1; INSERT INTO t (ucs2) VALUES (0x0000); SET ucs2_code_point = 0; WHILE ucs2_code_point <= 65535 DO SET @x = CONCAT('UPDATE t SET ucs2 = 0x',HEX(ucs2_code_point)); PREPARE stmt1 FROM @x; EXECUTE stmt1; UPDATE t SET c = ucs2, c_upper=UPPER(ucs2), c_lower=LOWER(ucs2); SET @failure = 0; IF (SELECT COUNT(*) FROM t WHERE c = '?' OR (UPPER(c) = c_upper AND LOWER(c) = c_lower)) <> 1 THEN SET @failure = 1; END IF; IF v_character_set_name = 'big5' and HEX(ucs2_code_point) IN ('430','431','432','433','43d','43e','43f', '440','441','442','2160','2161','2162','2163','2164', '2165','2166','2167','2168','2169') THEN SET @failure = 0; END IF; IF v_character_set_name = 'ujis' AND HEX(ucs2_code_point) IN ('F0','122','1F5') THEN SET @failure = 0; END IF; IF v_character_set_name = 'sjis' AND HEX(ucs2_code_point) IN ('212B') THEN SET @failure = 0; END IF; IF v_character_set_name = 'euckr' AND HEX(ucs2_code_point) IN ('111','212B','24d0','24d1','24d2', '24d3','24d4','24d5','24d6','24d7', '24d8','24d9','24dA','24DB','24DC','24DD', '24DE','24DF','24E0','24E1','24E2', '24E3','24E4','24E5','24E6','24E7', '24E8','24E9') THEN SET @failure = 0; END IF; IF v_character_set_name = 'gb2312' AND HEX(ucs2_code_point) IN ('E0','E1','E8','E8','E9','EA', 'EC','ED','F2','F3','F9', 'FA','FC','101','113','11B', '12B','14D','16B', '1CE','1D0','1D2','1D4','1D6', '1D8','1DA','1DC', '2160','2161','2162','2163','2164', '2165','2166','2167','2168','2169', '216A','216B') THEN SET @failure = 0; END IF; IF v_character_set_name = 'gbk' AND HEX(ucs2_code_point) IN ('E0','E1','E8','E8','E9','EA', 'EC','ED','F2','F3','F9', 'FA','FC','101','113','11B', '12B','144','148','14D','16B', '1CE','1D0','1D2','1D4','1D6', '1D8','1DA','1DC','216A','216B') THEN SET @failure = 0; END IF; IF v_character_set_name = 'cp932' AND HEX(ucs2_code_point) IN ('212B','2160','2161','2162','2163','2164', '2165','2166','2167','2168','2169', '2170','2171','2172','2173','2174', '2175','2176','2177','2178','2179') THEN SET @failure = 0; END IF; IF v_character_set_name = 'eucjpms' and HEX(ucs2_code_point) IN ('F0','122','1F5','2170','2171','2172','2173','2174', '2175','2176','2177','2178','2179') THEN SET @failure = 0; END IF; IF @failure = 1 THEN SELECT 'Failure!', v_character_set_name, HEX(ucs2), HEX(c), HEX(c_upper), HEX(c_lower), HEX(UPPER(c)), HEX(LOWER(c)) FROM t; END IF; SET ucs2_code_point = ucs2_code_point + 1; END WHILE; END IF; END WHILE; CLOSE cs; END// CALL p()// Examples of current failures with MySQL 6.0: +----------+----------------------+-----------+--------+--------------+ | Failure! | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | +----------+----------------------+-----------+--------+--------------+ | Failure! | big5 | 044C | C7E5 | C7C4 | +----------+----------------------+-----------+--------+--------------+ --------------+---------------+---------------+ HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) | --------------+---------------+---------------+ C7E5 | C7E5 | C7E5 | --------------+---------------+---------------+ 1 row in set (0.10 sec) ... +----------+----------------------+-----------+--------+--------------+ | Failure! | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | +----------+----------------------+-----------+--------+--------------+ | Failure! | sjis | 0451 | 8476 | 8446 | +----------+----------------------+-----------+--------+--------------+ --------------+---------------+---------------+ HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) | --------------+---------------+---------------+ 8476 | 8456 | 8476 | --------------+---------------+---------------+ 1 row in set (0.53 sec) (The SJIS example may be considered to be evidence of a bug.) The test can take many minutes. References ---------- These sources were used when checking the 'Only upper' and 'Only lower' assertions in the Affected Characters section. http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT http://www.rikai.com/library/kanjitables/kanji_codes.euc.shtml
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.