WL#11825: Add Chinese collation for utf8mb4

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

This worklog is to track the work adding Chinese collation for utf8mb4. This
collation will be named 'utf8mb4_zh_0900_as_cs', 'utf8mb4' is the character
set, 'zh' is the ISO code for Chinese, '0900' means this collation follows
the Unicode standard 9.0, 'as_cs' means accent sensitive and case sensitive.
F-1: The collation shall be compiled one.

F-2: The collation shall work for all characters in range [U+0, U+10FFFF].

F-3: The collation shall sort characters of the language correctly according to
     language specific rules.

F-4: The collation shall sort characters not belonging to the language according
     to their order in DUCET.

F-5: For the characters whose weight is not assigned in DUCET, the collation
     shall sort them with their implicit weight value which is constructed in
     the UCA way.

F-6: The collation shall be case / accent sensitive.
We'll follow the rules defined by the CLDR collation definition file for
Chinese, zh.xml, to implement this collation. The zh.xml file defines the
reordering rule for all character groups, and the sorting order of 41294 Han
characters, a few weight shifting rules for Bopomofo (some symbols which are 
used to present the pronunciation of Han characters) and some special
characters. Basically, we'll re-use the code we implemented for other utf8mb4
collations, but there are a few things we need to tune to make them work
correctly for Chinese.


The CLDR defines a rule "[reorder Hani]" for the Chinese collation. This rule
means all Han characters should be sorted before other characters (except for
core characters like spaces, symbols). The DUCET defines the sorting order of
character groups like [core group, Latin, Cyrillic ..., Hani, others]. Let us
give a name, GrpA, to the character groups between core group and Hani group.
After reordering, the character groups should be [core group, Hani, GrpA,

We have implemented reordering with WL#9108 and WL#9751. For the character
groups (except for Hani) which are involved in reordering, we calculate these
groups' reordering parameters, and apply the reordering when returning weight.
We'll do the same to the characters in GrpA (see item 4 below).
For Hani group, since the huge number of characters, and there is no simple
calculate to get a character's reordered weight based on its weight in the
DUCET, to save CPU cost, we pre-calculate Han character's weight with uca9dump.

To make this reordering, we change the primary weight of all characters
as below:
1. For the core group, we don't change their weight. It still sorts before
   all other character groups.

2. The zh.xml file defines the sort order of 41294 Han characters (let us call
   them Han group1). These characters have implicit weight in the DUCET which
   has two primary weights. We can re-use the weight range left by moving GrpA
   (see item 4 below) to give the characters in Han group1 a single primary
   weight. We'll give them a weight from 0x1C47 to 0xBD94 (0xBD94 - 0x1C47 + 1
   = 41294).

3. For other Han characters (let us call them Han group2) whose sorting order
   are not defined in zh.xml, we still give them two primary weights. The
   weights will be calculated like what the UCA calculates implicit weight,
   but the leading weight is changed from 0xFB40 - 0xFB85 to 0xBD95 - 0xBD99.
   This won't cause any weight conflict because the leading weight (0xBD95
   - 0xBD99) is bigger than the largest weight of Han group1 (0xBD94). Their
   relative order is the same as they are in the DUCET.

4. For the characters in GrpA, which had been assigned primary weight by the
   DUCET in the range [0x1C47, 0x54A3], we need to make them sorted after
   all Han characters. To do this, we will give them a new weight range
   [0xBD9A, 0xF5F6] (0xF5F6 - 0xBD9A = 0x54A3 - 0x1C47). This makes it sort
   after all Han characters in both Han group1 and Han group2. Because the
   smallest weight in this group (0xBD9A) is greater than the biggest weight
   of Han group1 (0xBD94), and the biggest leading weight of Han group2

5. For all other characters not mentioned above, we give them two primary
   weights. The weights will be calculated like how the UCA calculates
   implicit weight, but the leading weight is changed from 0xFBC0 - 0xFBE1 to
   0xF5F7 - 0xF618.

The primary weight changes look like:

  Char Group       | Origin Weight Range         | Reordered Weight Range
  core group       | 0200 - 1C46                 | 0200 - 1C46
  Han group1       | [FB40, AAAA] - [FB85, BBBB] | 1C47 - BD94
  Han group2       | [FB40, CCCC] - [FB85, DDDD] | [BD95, CCCC] - [BD99, DDDD]
  GrpA             | 1C47 - 54A3                 | BD9A - F5F6
  Others           | [FBC0, XXXX] - [FBE1, YYYY] | [F5F7, XXXX] - [F618, YYYY]

How the UCA calculates implicit weight and how we change it

We menctioned above that we'll calculate the weight of characters in Han
group2 and Others. Here let us have a look how the UCA calculates implicit
weight and how we change it. Take the character in Others group for example,
the UCA calculates its implicit weight as:
  AAAA = 0xFBC0 + (CP >> 15)      // AAAA is the leading weight
  BBBB = (CP & 0x7FFF) | 0x8000   // BBBB is the second weight
CP means character's code point. The only thing we need to change is to
replace the 0xFBC0 with 0xF5F7.

Weight shifting

Same as what we do for other collations to shift weight if there is rule like
'&A < B', we add an extra collation element for B. But for other collations,
we set the primary weight of extra collation element as 0x54A4, the biggest
primary weight of regular character plus one, to make sure there is no weight
overlapping. To achive the same goal, we need to set the primary weight of
extra collation element as 0xF619, which is immediately after the biggest
leading primary weight of Others group. Following is an example showing how
it works.
Assume we have a rule '&A < B' and the character A's weight is BD9A, then
we'll give B the weight of BD9A F619. If we are comparing string 'AC' and
'BD', then
1. if C and D are in core group, Han group1 or GrpA, the comparison result
   won't be affected by C and D's weight, because F619 is greater than any
   weight of these character groups.
2. if C or D are in Han group2 or Others, in which there is character whose
   second weight might be greater than F619, the comparison result won't be
   affected, because F619 is greater than the leading primary, and we compare
   strings by comparing their weight from the beginning byte.
   Ex, AC's weight is: BD9A F5F8 F6FF
       BD's weight is: BD9A F619 1C50