WL#4024: gb18030 Chinese character set

Affects: Server-5.7 — Status: Complete

In 2000, mainland China introduced a new character set:
gb18030, ""Chinese National Standard GB 18030-2000:
Information Technology -- Chinese ideograms coded character
set for information interchange -- Extension for the basic set".
There was a modification in 2005 so now it's GB 18030-2005.
This supersedes the two character sets that MySQL supports,
gb2312 and gbk.

Since 18030 is upward compatible with gb2312 and gbk,
and since the new characters in gb18030 are rare,
it has been possible to use gb2312 or gbk when the
true target is gb18030. But the Chinese government
doesn't approve of that, and it does cause problems
for users.

A prerequisite is: MySQL must support supplementary
Unicode characters as described in
WL#1213 Implement 4-byte UTF8, UTF16 and UTF32.
This was implemented in MySQL-6.0.

MySQL character set will be gb18030.
gb18030 will support UPPER/LOWER conversion
for all letters, including Latin extended characters
and non-Latin letters (Greek, Cyrillic, etc).


Collation names will be:

- gb18030_bin

- gb18030_chinese_ci
this collation will sort according to UPPER map,
.i.e. using UPPER(letter) as weight for this letter.



References
----------

Wikipedia article "GB 18030"
http://en.wikipedia.org/wiki/GB_18030

Functional requirements:

F-1: The charset shall be a compiled one.

F-2: GB18030 shall work in the same way with other charsets. It shall be
     available for sql/command such as 'SET NAMES', '_gb18030',
     'CHARACTER SET gb18030', etc. So does the collations GB18030 supports.

F-3: Conversion between GB18030 and other charsets(utf8, etc.) shall be
     supported. So sqls like 'CONVERT (xxx USING gb18030)' shall work. 

F-4: All the code points defined in the gb18030 standard shall be supported.
     As we know all the Unicode[U+0000, U+10FFFF] excluding [U+D800, U+E000)
     would be mapped to the following GB18030 ranges:
     * All 1-byte range
     * All 2-byte range
     * 4-byte range [GB+81308130, GB+E3329A35]
     vice versa. Although the code points in the ranges (GB+8431A439,
     GB+90308130) and [GB+E3329A36, GB+EF39EF39] are not assigned GB18030
     code points, They should be supported as well. They shall be treated as
     '?'(0x3F) and any conversion of them shall result in '?' too.

F-5: The private use area [U+E000, U+F8FF] in Unicode shall be mapped to
     gb18030 too. The gb18030 code points could be 4 or 2 bytes. Such as
     U+E000 is mapped to GB+AAA1, and U+F8FF is mapped to GB+84308130.

F-6: There's no mapping between [U+D800, U+DFFF] (The Surrogate area) and
     GB18030. The attempt of a conversion should raise an error, but for now,
     it shall result in a '?', because mysql client accept _ucs2 0xD800 etc.

F-7: If an incoming sequence of GB18030 is illegal, the charset shall raise
     an error on it or give a warning. If the illegal sequence is used in
     CONVERT, an error shall raise, but when it is used in INSERT etc, a
     warning shall be given.

F-8: We can do UPPER/LOWER to every GB18030 code points. The charset should
     support all the casefolding defined by Unicode(Please refer to the
     reference section of HLD, all 1-1 map is defined in UnicodeData.).
     
F-9: To keep consistency with utf8/utf8mb4, we shouldn't support UPPER for
     ligatures. So though U+FB03 is a ligature, we still get:
     UPPER(gb18030(U+FB03))=gb18030(U+FB03).

F-10: The search for ligatures shall match their upper-cases too. If we
      search 'FFI', 'ffi' or CONCAT(U+FB00, 'I'), the U+FB03 shall be a
      match. Since ALL the UPPER of them are 'FFI'. It's only supported in
      gb18030_unicode_520_ci collation. 

F-11: If some chararacter has more than 1 upper cases, we shall choose the
      one whose lower-case is the character itself. The others, may be
      ligatures and so on who have no lower-cases.

F-12: The charset should supports both gb18030_chinese_bin and
      gb18030_chinese_ci collations. The gb18030_chinese_bin shall act as all
      other bin collations;

F-13: The gb18030_chinese_ci shall sort all none Chinese characters according
      to UPPER map and the code points' code value. The charset defines the
      rules of comparison:
      - Convert all code point to UPPER before comparison
      - 4-byte code points are bigger than 2-bytes
      - 2-byte code points are bigger than 1-bytes
      - Such as 'A'='a',
        0x97=0x65<0x66<0x7F<0xAEA1<0xFEFE<0x81308130<0x81308131, etc.
      
      For those Chinese characters, they're sorted according to PINYIN
      collation defined in CLDR 24. And all the Chinese characters are bigger
      than all those none Chinese characters, except for GB+FE39FE39, which
      always has the biggest weight.

      So the sorting sequence shall be:
      None Chinese characters except GB+FE39FE39 sorted according to UPPER() <
      Chinese characters sorted according to PINYIN <
      GB+FE39FE39.

F-14: The charset should support comparison between 2 gb18030 strings or
      gb18030 and other charsets. There would be a conversion if the 2
      strings are in different charsets(either from gb18030 to others or
      from others to gb18030). Comparison including ignoring trailing spaces
      and not ignoring trailing shall be supported.

F-15: The max&min multi-byte length should be 4 and 1.
      The charset should determin the length of a sequence by leading 1 or 2
      bytes. Such as 0x61 should be determine as one GB18030 char, but 0x81
      is ambiguous for length determination. Also 0x8138 or 0x8150 is
      decidability.

Non-Functional requirements:

NF-1: The conversion/comparison between GB18030 and other charset might be
      a little more expensive(Because GB18030 is a multi-byte charset and the
      length of a character could be 1/2/4, the implementation would be more
      complicated than other charsets), But this shouldn't reduce the server
      throughput.

Relations between GB18030-2000 and GB18030-2005
-----------------------------------------------
 - Their coding structures are the same, GB18030-2005 is a 
   replacement(extension) of GB18030-2000
 - Characters of the CJK Unified Hanzi Extension B is included in GB18030-2005
 - A few code points of GB18030-2000 are adjusted in GB18030-2005
 - GB18030-2000 is a mandatory standard, the extensions in GB18030-2005(above 
   two issues) are optional(recommended)

It had better that GB18030-2005(GB18030 for short) is supported in MySQL

Coding structure of GB18030
---------------------------
A multi-byte encoding using 1-byte, 2-byte and 4-byte codes.
 - Single-byte: 00-7F
 - Two-byte: 81-FE | 40-7E, 80-FE
 - Four-byte: 81-FE | 30-39 | 81-FE | 30-39

Features and challenges
-----------------------
 - It is not possible for all codepage byte sequences to determine the length
   of the sequence from the first byte, the second byte must be examined as
   well(A significant difference)
 - Code page of GB18030 is huge, there are more than 1.6 million valid byte 
   sequences
 - It is similar to a UTF: All 1.1 million Unicode code points U+0000-U+10ffff 
   except for surrogates U+D800-U+DFFF map to and from GB 18030 codes
 - Random access in GB 18030 text is even more difficult than in other 
   multi-byte codepages


Mapping table
-------------
Most of these mappings, except for parts of the BMP(U+0000 ~ U+FFFF), can be
done algorithmically
 - Single-byte(128 code points) 00-7F are mapped to Unicode U+0000-U+007F
 - Two-byte(23940 code points) and Four-byte less than 0x90308130(39420 code
   points) are mapped to Unicode U+0080-U+FFFF except U+D800-U+DFFF
 - Four-byte: 0x90308130 ~ 0xE3329A35 are mapped to Unicode U+10000-U+10FFFF

The mapping table for code points as Unicode U+0080-U+FFFF must be explicitly
defined, except for some large ranges, including:
 - U+0452-U+1E3E to GB+8130D330-GB+8135F436, 6637 code points
 - U+2643-U+2E80 to GB+8137A839-GB+8138FD38, 2110 code points
 - U+9FA6-U+D7FF to GB+82358F33-GB+8336C738, 14426 code points
 - U+E865-U+F92B to GB+8336D030-GB+84308534, 4295 code points
 - U+FA2A-U+FE2F to GB+84309C38-GB+84318537, 1030 code points
Other ranges, which cover less than 1000 code points, will be ignored here.

Mapping from GB18030 to Unicdoe
 - Use the mapping table except for those ranges
 - For the two-byte GB18030 code points, mapping table is from GB encoding to
   Unicode encoding
 - For the four-byte GB18030 code points, mapping table is from the difference 
   between the code point and GB+81308130 to Unicode
 - Example: GB+81308131 is mapped to U+0081, and the mapping relation is
   0x01 -> U+0081, and GB+81308230 is mapped to U+008A, and the mapping
   relation is 0x0A -> U+008A, etc.

Mapping from Unicode to GB18030
 - All Unicode code points are mapped to an encoding defined as below:
   1. Two-bytes GB18030 code points, whose leading byte is GE than 0x81
   2. Two-bytes difference, whose leading byte is less than 0x81
      2.1 For Unicode LE than U+D7FF, the difference is between GB code point
          value and GB+81308130, in fact mapping of the range of U+9FA6 to
          U+D7FF can be done algorithmically
      2.2 For Unicode GE than U+E000 but LE than 0XFFFF, the difference is
          calculated by the difference between GB code point value and
          GB+81308130 minus 7456, to make sure the leading byte of the final
          difference which is guaranteed to be >0 will less than 0x81

An exception range is [U+D800, U+DFFF], which is only reserved for internal
usage of UTF-16. GB18030 has no mappings for these code points and will return
illegal sequence if the input is in the range.

Currently, we do support those unused 4-byte code points which have no
corresponding code points in Unicode, which are (GB+8431A439, GB+90308130)
and (GB+E3329A35, GB+FE39FE39). These characters are treated as '?'(0x3F).


Unicode PUA
-----------
There is a full mapping between gb18030 and Unicode, so all PUA code points
are mapped to gb18030 as well, vice versa. 

As we know, gbk maps 2149 code points to Unicode PUA[U+E000, U+E864], 80 of
them are characters Unicode didn't define then, while the others are
unassigned code points in gbk. gb18030 inherits the mappings for these 80
characters. But these 80 characters are currently moved to non-PUA in later
Unicode. So we can use those characters defined in non-PUA instead of PUA.
gb18030 also includes those mapping between characters and non-PUA.

For example:
GB+FE64 and GB+8336CA39 are different gb18030 code points and the former is
mapped to U+3A73 while the latter is mapped to U+E829 which is in PUA, having
no definition in Unicode. If the user need to treat them as equal, that may
be done in a user-defined collation or with some conversion of some kind.

That means our gb18030 supports the mappings between code points in gb18030
and Unicode, rather than the mappings between characters and gb18030 code
points. In fact, gb18030 defines a full mapping between gb18030 code points
and PUA[U+E000, U+F8FF], but most of them are unassigned code points in gb18030.


Unicase
-------
Since GB18030 support all code points in Unicode, all the letters who have
UPPER/LOWER form must be taken into account. All possible code points for these
letters are collected in a CaseFolding file published by Unicode.

Since GB18030 has variable-length encoding, which may be 4-bytes,
it's nearly impossible to use a mapping array to store the {UPPER, LOWER, SORT}
tuples for all case sensitive code points directly, such as we do for gbk,
utf-8, etc. But we can do it in the way we do for mapping tables between
gb18030 and unicode:

 - {UPPER, LOWER, SORT} is used for all the code points, UPPER/LOWER are
   encoded as gb18030 and SORT is encoded as Unicode
 - For every UPPER or LOWER, if it's
   * 1-byte(0x00 - 0X7F), keep it as is(or we can use to_lower/to_upper arrays)
   * 2-byte(0xA200 - 0XA2FF, 0xA3xx, 0XA6xx, 0xA7xx), keep it as is
   * 4-byte, calculate the difference between code point encoding and
     GB+81308130. For GB+81308231, the difference will be 1*10+1=11=0x0B.
     Nearly all differences will be less then 0x9600, the only exception is
     that some differences will be more than 0x2E600 and less than 0x2E6FF.
     The whole region would not be conflict with the 2-byte area, but would be
     conflict with the 1-byte area for some LOWER would be less than 0x0080.
     So the differences which are less than 0x9600 should plus 0x80 to avoid
     the conflict, and the region from 0x2E600 to 0x2E6FF would be regarded as
     from 0xE600 to 0xE6FF.
 - If a tuple is defined, such as {0xA2FB, 0x20B0, 0x216A}, all the code points
   with the same leading byte, in this case is 0xA2, should be defined too, so
   the region will be 0xA200-0xA2FF
 - All the possible regions (21 in total) wiil be
	[0x0000, 0x00FF] /* store the actual code points for 1-byte gb18030
                         encoding */
	----------------
	[0x0100, 0x01FF] /* store the differences between the GB+ encoding and
	[0x0200, 0x02FF] GB+81308130 with the addition of 0x80 */
	[0x0300, 0x03FF]
	[0x0400, 0x04FF]
	[0x1000, 0x10FF]
	[0x1D00, 0x1DFF]
	[0x1E00, 0x1EFF]  
	[0x1F00, 0x1FFF]
	[0x2000, 0x20FF]  
	[0x2300, 0x23FF]
	[0x2A00, 0x2AFF]  
	[0x2B00, 0x2BFF]
	[0x5100, 0x51FF]
	[0x5200, 0x52FF]
	----------------
	[0xA200, 0xA2FF] /* store the actual code points for 2-bytes gb18030
	[0xA300, 0xA3FF] encoding */
	[0xA600, 0xA6FF]
	[0xA700, 0xA7FF]
	[0xA800, 0xA8FF]
	----------------
	[0xE600, 0xE6FF] /* 2E600-2E6FF will be the true differences for
                         4-bytes encoding */
 - Finally, an array with length of 256 can be defined for MY_UNICASE_INFO for
gb18030

Special case: ligatures

Some characters in Unicode have upper-cases as ligatures, such as, U+FB00,
its upper-case is 'FF'(U+0046 0046). That means one Unicode cp will map to
sequence of multiple Unicode cps. Currently, We don't deal with these
ligatures because our existing utf8/utf8mb4 don't deal with them too. We
would like to keep these charsets the same behavior.


Collation
---------
Currently, three collations are supported:
 - gb18030_bin
 - gb18030_chinese_ci
 - gb18030_chinese_unicode520_ci

gb18030_bin behaves the same as other bin collations, such as gbk/ujis/big5,
etc. while gb18030_chinese_unicode520_ci haves the same as any other
unicode520_ci collations. To make sure the ligatures are sorted in a correct
way, we can choose this collation to sort those none Chinese characters.

For gb18030_chinese_ci which supports PINYIN, we have following sort rules:
 - For all none Chinese characters, the original sort key will be GB(UPPER(ch))
   if UPPER(ch) exsists, otherwise GB(ch).
 - For all none Chinese characters, the sort is based on the order of the
   original sort key. So:
   0x01 <...< 0x7F < 0xAEA1 <...< 0xFEFE < 0x81308130 <...< 0xE3329A35 <
   ...< 0xFE39FE39
 - For all Chinese characters, they're sorted according to their PINYIN, the
   collation of PINYIN we used is defined in zh.xml of CLDR24. And all the
   Chinese characters are sorted after all those none Chinese characters except
   GB+E3329A35 which is the max code point.

Considering the encoding of GB18030, a 2-bytes cp may be bigger than
a 4-bytes cp when they are compared by memcmp, such as GB+AEA1 will be bigger
than GB+81308130. To conform to the above rules for strnxfrm, for every none
Chinese character GB18030 cp, the sort key should be adjusted as following:
 - For sort key in 1-byte and 2-bytes, the sort key will be original sort key
 - For sort key in 4-bytes, the sort key will be the difference between the
   GB(sort key) - 0x81308130 plus 0xFF000000, because there are nearly 1.6M
   code points for 4-bytes encoding, all the differences can be saved in
   3 bytes.
 - The leading byte for 1-byte sort key is less than 0x80, 0x81-0xFE for
   2-bytes and 0xFF for 4-bytes, which makes the comparison correct.
 - For GB+FE39FE39, its weight shall be 0xFFFFFFFF.

While for those Chinese characters, each of them has a seq NO. which is >0
according to the PININ collation. And their weight would be 0xFFA00000
plus the seq NO. So we have:
  None Chinese characters except GB+FE39FE39 sorted according to UPPER() <
  Chinese characters sorted according to PINYIN < GB+FE39FE39.

For PUA issue mentioned above, the PINYIN collation only supports those
non-PUA characters.

Because the weight would be more than twice the length of the code
point, such as weight(GB+8140) shall be a 4-byte one, the weight 
will consume 4-byte and the original key is 2-byte, so the
strnxfrm_multiplier shall be 2 for ci collation. So does the caseup_multiply.
And casedn_multiply shall be 2.

It's unnecessary to support PINYIN and ligatures in the same collation.
Because Chinese useres would care for PINYIN only while other users propably
care for ligatures only.


mbcharlen
---------
The length of a mb char can be determined by the leading byte in all character
set, except GB18030. In GB18030, we can get the length by 1 byte for 1-byte
code points, but by 2 bytes for 2-bytes and 4-bytes code points.

Two more APIs for mbcharlen are defined in m_ctype.h as
1. my_mbcharlen_2(s, a, b)
2. my_mbmaxlenlen(s)

The first one is for getting the length by two bytes, and the second one return
a value indicating the max possible bytes to be used to retrieve the mb length.
For GB18030, the value should be 2, while it's 1 for other character sets.

We can get the length by following steps:
 - call my_mbcharlen(s, a)
 - if the result is non-zero, ok, we got it
 - if the result is 0 and my_mbmaxlenlen(s) is 2, it maybe a GB18030 mb char
 - try to get next char, if fail, treat previous char as a single char
 - call my_mbcharlen_2(s, a, b) to get the length, if result is non-zero, we
   got what we want, if not, treat previous char as a single char
 - in some cases, we have to put the second char back to prevent affecting
   the original logic of code


References
----------

Collation chart for zh@collation=pinyin (ICU 4.4.2 - CLDR 1.8.1)
http://collation-charts.org/icu442/icu442-zh@collation=pinyin.html

GB 18030: A mega-codepage 
http://icu-project.org/docs/papers/gb18030.html

GB18030
http://source.icu-project.org/repos/icu/data/trunk/charset/source/gb18030/gb18030.html

Some collations (including PINYIN) on Unicode website
http://www.unicode.org/review/pr-175/