WL#4024: gb18030 Chinese character set
Affects: Server-5.7
—
Status: Complete
In 2000, mainland China introduced a new character set: gb18030, ""Chinese National Standard GB 18030-2000: Information Technology -- Chinese ideograms coded character set for information interchange -- Extension for the basic set". There was a modification in 2005 so now it's GB 18030-2005. This supersedes the two character sets that MySQL supports, gb2312 and gbk. Since 18030 is upward compatible with gb2312 and gbk, and since the new characters in gb18030 are rare, it has been possible to use gb2312 or gbk when the true target is gb18030. But the Chinese government doesn't approve of that, and it does cause problems for users. A prerequisite is: MySQL must support supplementary Unicode characters as described in WL#1213 Implement 4-byte UTF8, UTF16 and UTF32. This was implemented in MySQL-6.0. MySQL character set will be gb18030. gb18030 will support UPPER/LOWER conversion for all letters, including Latin extended characters and non-Latin letters (Greek, Cyrillic, etc). Collation names will be: - gb18030_bin - gb18030_chinese_ci this collation will sort according to UPPER map, .i.e. using UPPER(letter) as weight for this letter. References ---------- Wikipedia article "GB 18030" http://en.wikipedia.org/wiki/GB_18030
Functional requirements: F-1: The charset shall be a compiled one. F-2: GB18030 shall work in the same way with other charsets. It shall be available for sql/command such as 'SET NAMES', '_gb18030', 'CHARACTER SET gb18030', etc. So does the collations GB18030 supports. F-3: Conversion between GB18030 and other charsets(utf8, etc.) shall be supported. So sqls like 'CONVERT (xxx USING gb18030)' shall work. F-4: All the code points defined in the gb18030 standard shall be supported. As we know all the Unicode[U+0000, U+10FFFF] excluding [U+D800, U+E000) would be mapped to the following GB18030 ranges: * All 1-byte range * All 2-byte range * 4-byte range [GB+81308130, GB+E3329A35] vice versa. Although the code points in the ranges (GB+8431A439, GB+90308130) and [GB+E3329A36, GB+EF39EF39] are not assigned GB18030 code points, They should be supported as well. They shall be treated as '?'(0x3F) and any conversion of them shall result in '?' too. F-5: The private use area [U+E000, U+F8FF] in Unicode shall be mapped to gb18030 too. The gb18030 code points could be 4 or 2 bytes. Such as U+E000 is mapped to GB+AAA1, and U+F8FF is mapped to GB+84308130. F-6: There's no mapping between [U+D800, U+DFFF] (The Surrogate area) and GB18030. The attempt of a conversion should raise an error, but for now, it shall result in a '?', because mysql client accept _ucs2 0xD800 etc. F-7: If an incoming sequence of GB18030 is illegal, the charset shall raise an error on it or give a warning. If the illegal sequence is used in CONVERT, an error shall raise, but when it is used in INSERT etc, a warning shall be given. F-8: We can do UPPER/LOWER to every GB18030 code points. The charset should support all the casefolding defined by Unicode(Please refer to the reference section of HLD, all 1-1 map is defined in UnicodeData.). F-9: To keep consistency with utf8/utf8mb4, we shouldn't support UPPER for ligatures. So though U+FB03 is a ligature, we still get: UPPER(gb18030(U+FB03))=gb18030(U+FB03). F-10: The search for ligatures shall match their upper-cases too. If we search 'FFI', 'ffi' or CONCAT(U+FB00, 'I'), the U+FB03 shall be a match. Since ALL the UPPER of them are 'FFI'. It's only supported in gb18030_unicode_520_ci collation. F-11: If some chararacter has more than 1 upper cases, we shall choose the one whose lower-case is the character itself. The others, may be ligatures and so on who have no lower-cases. F-12: The charset should supports both gb18030_chinese_bin and gb18030_chinese_ci collations. The gb18030_chinese_bin shall act as all other bin collations; F-13: The gb18030_chinese_ci shall sort all none Chinese characters according to UPPER map and the code points' code value. The charset defines the rules of comparison: - Convert all code point to UPPER before comparison - 4-byte code points are bigger than 2-bytes - 2-byte code points are bigger than 1-bytes - Such as 'A'='a', 0x97=0x65<0x66<0x7F<0xAEA1<0xFEFE<0x81308130<0x81308131, etc. For those Chinese characters, they're sorted according to PINYIN collation defined in CLDR 24. And all the Chinese characters are bigger than all those none Chinese characters, except for GB+FE39FE39, which always has the biggest weight. So the sorting sequence shall be: None Chinese characters except GB+FE39FE39 sorted according to UPPER() < Chinese characters sorted according to PINYIN < GB+FE39FE39. F-14: The charset should support comparison between 2 gb18030 strings or gb18030 and other charsets. There would be a conversion if the 2 strings are in different charsets(either from gb18030 to others or from others to gb18030). Comparison including ignoring trailing spaces and not ignoring trailing shall be supported. F-15: The max&min multi-byte length should be 4 and 1. The charset should determin the length of a sequence by leading 1 or 2 bytes. Such as 0x61 should be determine as one GB18030 char, but 0x81 is ambiguous for length determination. Also 0x8138 or 0x8150 is decidability. Non-Functional requirements: NF-1: The conversion/comparison between GB18030 and other charset might be a little more expensive(Because GB18030 is a multi-byte charset and the length of a character could be 1/2/4, the implementation would be more complicated than other charsets), But this shouldn't reduce the server throughput.
Relations between GB18030-2000 and GB18030-2005 ----------------------------------------------- - Their coding structures are the same, GB18030-2005 is a replacement(extension) of GB18030-2000 - Characters of the CJK Unified Hanzi Extension B is included in GB18030-2005 - A few code points of GB18030-2000 are adjusted in GB18030-2005 - GB18030-2000 is a mandatory standard, the extensions in GB18030-2005(above two issues) are optional(recommended) It had better that GB18030-2005(GB18030 for short) is supported in MySQL Coding structure of GB18030 --------------------------- A multi-byte encoding using 1-byte, 2-byte and 4-byte codes. - Single-byte: 00-7F - Two-byte: 81-FE | 40-7E, 80-FE - Four-byte: 81-FE | 30-39 | 81-FE | 30-39 Features and challenges ----------------------- - It is not possible for all codepage byte sequences to determine the length of the sequence from the first byte, the second byte must be examined as well(A significant difference) - Code page of GB18030 is huge, there are more than 1.6 million valid byte sequences - It is similar to a UTF: All 1.1 million Unicode code points U+0000-U+10ffff except for surrogates U+D800-U+DFFF map to and from GB 18030 codes - Random access in GB 18030 text is even more difficult than in other multi-byte codepages Mapping table ------------- Most of these mappings, except for parts of the BMP(U+0000 ~ U+FFFF), can be done algorithmically - Single-byte(128 code points) 00-7F are mapped to Unicode U+0000-U+007F - Two-byte(23940 code points) and Four-byte less than 0x90308130(39420 code points) are mapped to Unicode U+0080-U+FFFF except U+D800-U+DFFF - Four-byte: 0x90308130 ~ 0xE3329A35 are mapped to Unicode U+10000-U+10FFFF The mapping table for code points as Unicode U+0080-U+FFFF must be explicitly defined, except for some large ranges, including: - U+0452-U+1E3E to GB+8130D330-GB+8135F436, 6637 code points - U+2643-U+2E80 to GB+8137A839-GB+8138FD38, 2110 code points - U+9FA6-U+D7FF to GB+82358F33-GB+8336C738, 14426 code points - U+E865-U+F92B to GB+8336D030-GB+84308534, 4295 code points - U+FA2A-U+FE2F to GB+84309C38-GB+84318537, 1030 code points Other ranges, which cover less than 1000 code points, will be ignored here. Mapping from GB18030 to Unicdoe - Use the mapping table except for those ranges - For the two-byte GB18030 code points, mapping table is from GB encoding to Unicode encoding - For the four-byte GB18030 code points, mapping table is from the difference between the code point and GB+81308130 to Unicode - Example: GB+81308131 is mapped to U+0081, and the mapping relation is 0x01 -> U+0081, and GB+81308230 is mapped to U+008A, and the mapping relation is 0x0A -> U+008A, etc. Mapping from Unicode to GB18030 - All Unicode code points are mapped to an encoding defined as below: 1. Two-bytes GB18030 code points, whose leading byte is GE than 0x81 2. Two-bytes difference, whose leading byte is less than 0x81 2.1 For Unicode LE than U+D7FF, the difference is between GB code point value and GB+81308130, in fact mapping of the range of U+9FA6 to U+D7FF can be done algorithmically 2.2 For Unicode GE than U+E000 but LE than 0XFFFF, the difference is calculated by the difference between GB code point value and GB+81308130 minus 7456, to make sure the leading byte of the final difference which is guaranteed to be >0 will less than 0x81 An exception range is [U+D800, U+DFFF], which is only reserved for internal usage of UTF-16. GB18030 has no mappings for these code points and will return illegal sequence if the input is in the range. Currently, we do support those unused 4-byte code points which have no corresponding code points in Unicode, which are (GB+8431A439, GB+90308130) and (GB+E3329A35, GB+FE39FE39). These characters are treated as '?'(0x3F). Unicode PUA ----------- There is a full mapping between gb18030 and Unicode, so all PUA code points are mapped to gb18030 as well, vice versa. As we know, gbk maps 2149 code points to Unicode PUA[U+E000, U+E864], 80 of them are characters Unicode didn't define then, while the others are unassigned code points in gbk. gb18030 inherits the mappings for these 80 characters. But these 80 characters are currently moved to non-PUA in later Unicode. So we can use those characters defined in non-PUA instead of PUA. gb18030 also includes those mapping between characters and non-PUA. For example: GB+FE64 and GB+8336CA39 are different gb18030 code points and the former is mapped to U+3A73 while the latter is mapped to U+E829 which is in PUA, having no definition in Unicode. If the user need to treat them as equal, that may be done in a user-defined collation or with some conversion of some kind. That means our gb18030 supports the mappings between code points in gb18030 and Unicode, rather than the mappings between characters and gb18030 code points. In fact, gb18030 defines a full mapping between gb18030 code points and PUA[U+E000, U+F8FF], but most of them are unassigned code points in gb18030. Unicase ------- Since GB18030 support all code points in Unicode, all the letters who have UPPER/LOWER form must be taken into account. All possible code points for these letters are collected in a CaseFolding file published by Unicode. Since GB18030 has variable-length encoding, which may be 4-bytes, it's nearly impossible to use a mapping array to store the {UPPER, LOWER, SORT} tuples for all case sensitive code points directly, such as we do for gbk, utf-8, etc. But we can do it in the way we do for mapping tables between gb18030 and unicode: - {UPPER, LOWER, SORT} is used for all the code points, UPPER/LOWER are encoded as gb18030 and SORT is encoded as Unicode - For every UPPER or LOWER, if it's * 1-byte(0x00 - 0X7F), keep it as is(or we can use to_lower/to_upper arrays) * 2-byte(0xA200 - 0XA2FF, 0xA3xx, 0XA6xx, 0xA7xx), keep it as is * 4-byte, calculate the difference between code point encoding and GB+81308130. For GB+81308231, the difference will be 1*10+1=11=0x0B. Nearly all differences will be less then 0x9600, the only exception is that some differences will be more than 0x2E600 and less than 0x2E6FF. The whole region would not be conflict with the 2-byte area, but would be conflict with the 1-byte area for some LOWER would be less than 0x0080. So the differences which are less than 0x9600 should plus 0x80 to avoid the conflict, and the region from 0x2E600 to 0x2E6FF would be regarded as from 0xE600 to 0xE6FF. - If a tuple is defined, such as {0xA2FB, 0x20B0, 0x216A}, all the code points with the same leading byte, in this case is 0xA2, should be defined too, so the region will be 0xA200-0xA2FF - All the possible regions (21 in total) wiil be [0x0000, 0x00FF] /* store the actual code points for 1-byte gb18030 encoding */ ---------------- [0x0100, 0x01FF] /* store the differences between the GB+ encoding and [0x0200, 0x02FF] GB+81308130 with the addition of 0x80 */ [0x0300, 0x03FF] [0x0400, 0x04FF] [0x1000, 0x10FF] [0x1D00, 0x1DFF] [0x1E00, 0x1EFF] [0x1F00, 0x1FFF] [0x2000, 0x20FF] [0x2300, 0x23FF] [0x2A00, 0x2AFF] [0x2B00, 0x2BFF] [0x5100, 0x51FF] [0x5200, 0x52FF] ---------------- [0xA200, 0xA2FF] /* store the actual code points for 2-bytes gb18030 [0xA300, 0xA3FF] encoding */ [0xA600, 0xA6FF] [0xA700, 0xA7FF] [0xA800, 0xA8FF] ---------------- [0xE600, 0xE6FF] /* 2E600-2E6FF will be the true differences for 4-bytes encoding */ - Finally, an array with length of 256 can be defined for MY_UNICASE_INFO for gb18030 Special case: ligatures Some characters in Unicode have upper-cases as ligatures, such as, U+FB00, its upper-case is 'FF'(U+0046 0046). That means one Unicode cp will map to sequence of multiple Unicode cps. Currently, We don't deal with these ligatures because our existing utf8/utf8mb4 don't deal with them too. We would like to keep these charsets the same behavior. Collation --------- Currently, three collations are supported: - gb18030_bin - gb18030_chinese_ci - gb18030_chinese_unicode520_ci gb18030_bin behaves the same as other bin collations, such as gbk/ujis/big5, etc. while gb18030_chinese_unicode520_ci haves the same as any other unicode520_ci collations. To make sure the ligatures are sorted in a correct way, we can choose this collation to sort those none Chinese characters. For gb18030_chinese_ci which supports PINYIN, we have following sort rules: - For all none Chinese characters, the original sort key will be GB(UPPER(ch)) if UPPER(ch) exsists, otherwise GB(ch). - For all none Chinese characters, the sort is based on the order of the original sort key. So: 0x01 <...< 0x7F < 0xAEA1 <...< 0xFEFE < 0x81308130 <...< 0xE3329A35 < ...< 0xFE39FE39 - For all Chinese characters, they're sorted according to their PINYIN, the collation of PINYIN we used is defined in zh.xml of CLDR24. And all the Chinese characters are sorted after all those none Chinese characters except GB+E3329A35 which is the max code point. Considering the encoding of GB18030, a 2-bytes cp may be bigger than a 4-bytes cp when they are compared by memcmp, such as GB+AEA1 will be bigger than GB+81308130. To conform to the above rules for strnxfrm, for every none Chinese character GB18030 cp, the sort key should be adjusted as following: - For sort key in 1-byte and 2-bytes, the sort key will be original sort key - For sort key in 4-bytes, the sort key will be the difference between the GB(sort key) - 0x81308130 plus 0xFF000000, because there are nearly 1.6M code points for 4-bytes encoding, all the differences can be saved in 3 bytes. - The leading byte for 1-byte sort key is less than 0x80, 0x81-0xFE for 2-bytes and 0xFF for 4-bytes, which makes the comparison correct. - For GB+FE39FE39, its weight shall be 0xFFFFFFFF. While for those Chinese characters, each of them has a seq NO. which is >0 according to the PININ collation. And their weight would be 0xFFA00000 plus the seq NO. So we have: None Chinese characters except GB+FE39FE39 sorted according to UPPER() < Chinese characters sorted according to PINYIN < GB+FE39FE39. For PUA issue mentioned above, the PINYIN collation only supports those non-PUA characters. Because the weight would be more than twice the length of the code point, such as weight(GB+8140) shall be a 4-byte one, the weight will consume 4-byte and the original key is 2-byte, so the strnxfrm_multiplier shall be 2 for ci collation. So does the caseup_multiply. And casedn_multiply shall be 2. It's unnecessary to support PINYIN and ligatures in the same collation. Because Chinese useres would care for PINYIN only while other users propably care for ligatures only. mbcharlen --------- The length of a mb char can be determined by the leading byte in all character set, except GB18030. In GB18030, we can get the length by 1 byte for 1-byte code points, but by 2 bytes for 2-bytes and 4-bytes code points. Two more APIs for mbcharlen are defined in m_ctype.h as 1. my_mbcharlen_2(s, a, b) 2. my_mbmaxlenlen(s) The first one is for getting the length by two bytes, and the second one return a value indicating the max possible bytes to be used to retrieve the mb length. For GB18030, the value should be 2, while it's 1 for other character sets. We can get the length by following steps: - call my_mbcharlen(s, a) - if the result is non-zero, ok, we got it - if the result is 0 and my_mbmaxlenlen(s) is 2, it maybe a GB18030 mb char - try to get next char, if fail, treat previous char as a single char - call my_mbcharlen_2(s, a, b) to get the length, if result is non-zero, we got what we want, if not, treat previous char as a single char - in some cases, we have to put the second char back to prevent affecting the original logic of code References ---------- Collation chart for zh@collation=pinyin (ICU 4.4.2 - CLDR 1.8.1) http://collation-charts.org/icu442/icu442-zh@collation=pinyin.html GB 18030: A mega-codepage http://icu-project.org/docs/papers/gb18030.html GB18030 http://source.icu-project.org/repos/icu/data/trunk/charset/source/gb18030/gb18030.html Some collations (including PINYIN) on Unicode website http://www.unicode.org/review/pr-175/
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.