WL#9125: Add utf8mb4_800_ci_ai

Affects: Server-8.0   —   Status: Complete

We got many requests from market to replace the current default character set,
latin1, with utf8mb4 to support all characters in BMP / SMP. Although we already
have utf8mb4 character set and some collations based on it, but all these
collations are case insensitive. To support accent / case sensitive sorting, and
even support punctuation to make sorting Japanese possible later, we decide to
add new collations which implement latest UCA (Unicode Collation Algorithm)
whose version is 8.0.0.

This WL is to track the implementation of adding new general collation of
utf8mb4, which sorts all characters in default order. 

The new general collation's name will be: utf8mb4_800_ci_ai. "800" means the
implementation is based on UCA 8.0.0, "ci" means case insensitive, and "ai"
means accent insensitive.


F-1: The collation shall be a compiled one.

F-2: The collation shall work for all characters in range [U+0, U+10FFFF].

F-3: For the characters whose code point is assigned in DUCET (Default Unicode
     Collation Element Table), the collation shall sort them according to the
     weight value assigned.

F-4: For the characters whose code point is not assigned in DUCET, the collation
     shall sort them with their implicit weight value which is constructed with 
     the UCA way.

F-5: The collation shall be case / accent insensitive.

F-6: To simplify the collation, characters in contraction sequence will be
     treated as separate characters.

Most of the sorting of this collation will be based on the data from DUCET.
DUCET defines weight data for about 30k characters. One character may have 1 more 
collation elements And there are 3 unsigned 16 bit integer value in each 
collation element. They are called level 1 (or primary), level 2 (or secondary), 
level 3 (or tertiary) weight. Primary weight is for sorting the base character, 
and secondary weight is for accent sensitive sorting, tertiary weight is for case 
sensitive sorting. For example:
0061  ; [.1BC2.0020.0002] # LATIN SMALL LETTER A
0041  ; [.1BC2.0020.0008] # LATIN CAPITAL LETTER A
With case insensitive collation, we shall sort "a" and "A" as same, because their 
primary weights are same.

How we import DUCET:
We are to import the whole DUCET, not only the primary weight, but also the 
secondary and tertiary weights, because we are going to use it to implement 
accent / case sensitive collations later.

How we use DUCET:
We have already implemented UCA 5.2.0 in the past, and the algorithm didn't 
change since then, so we can re-use the source code. But because the UCA 5.2.0 
collations only support primary weight to implement accent / case insensitive 
sorting, we need to change the code to make it work for multilevel weight, and 
let the code decide how many levels of weight it should use according to the