WL#9751: Add Japanese collation to utf8mb4

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture

We already have Japanese collations for SJIS and UJIS character sets, but there
is no Japanese collation for UTF8. We'll use this to track the implementation
of Japanese collation for utf8mb4 on the base of utf8mb4 collations built on
latest Unicode 9.0.

We had bugs like BUG#76553, our current collations cannot compare Japanese
characters correctly.

We'll implement this task on the rules defined for Japanese in CLDR v30.

User Documentation
==================

https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-1.html

F-1: The collation shall be compiled one.

F-2: The collation shall work for all characters in range [U+0, U+10FFFF].

F-3: The collation shall sort characters of Japanese language correctly
     according to language specific rules defined by CLDR.

F-4: The collation shall sort characters not belonging to the language
     according to their order in DUCET.

F-5: For the characters whose weight is not assigned in DUCET, the
     collation shall sort them with their implicit weight value which is
     constructed in the UCA way.

F-6: The collation shall be case / accent sensitive.

NF-1: The performance of this collation's functions should have regression of
      no more than 10% comparing to utf8mb4's general case / accent sensitive
      collation (utf8mb4_0900_as_cs).

CLDR defines rule "[strength 3]" for Japanese collation. Which means by default
three levels of weight will be used to implement this collation. We'll call
this new collation "utf8mb4_ja_0900_as_cs". "utf8mb4" is the character set.
"ja" is the language, Japanese. "0900" means the UCA version, 9.0.0. And
"as_cs" means we'll use the secondary and tertiary weight defined in DUCET.

1. Hiragana and Katakana
  DUCET has already organized weight for Hiragana and Katakana to sort them
  correctly. It uses level 2 weight to differentiate "voiced kana" from
  "voiceless kana". So the "mother-father" problem (BUG#76553) is solved.
  Example,
  306F  ; [.3D74.0020.000E] # HIRAGANA LETTER HA
  3070  ; [.3D74.0020.000E][.0000.0037.0002] # HIRAGANA LETTER BA
  "HA" is "voiceless kana" and "BA" is "voiced kana". The primary weight of
  both "HA" and "BA" is same, 3D74. But their secondary weight is different,
  0020 for "HA" and 0020 0037 for "BA".

  The weight of Hiragana and Katakana has also already grantee that Hiragana
  sort before Katakana. They are distinguished by the level 3 weight.
  Example,
  306F  ; [.3D74.0020.000E] # HIRAGANA LETTER HA
  30CF  ; [.3D74.0020.0011] # KATAKANA LETTER HA
  The primary weight of both Hiragana and Katakana "HA" is same, 3D74. The
  secondary weight of them is same too, 0020. But their tertiary weight is
  different, 000E for Hiragana "HA" and 0011 for Katakana "HA".

2. Length mark and iteration mark
  Length mark ('ー', U+30FC) is a Japanese symbol which indicates a long vowel
  of two mora (syllable timing) in length. It is usually used in Katakana and
  Hiragana writing. CLDR has a series of rules which define how the length
  mark should sort when it follows different Katakana / Hiragana character.
  We'll use these rules to tailor its weight.

  Iteration mark is a punctuation mark that represent a duplicated character.
  For example, "人 hito" means person, "人々 hitobito" means people. The "々"
  is an iteration mark for Hanji. Japanese also has iteration mark "ゝ" for
  Hiragana and "ヽ" for Katakana. CLDR has rules which define how the
  iteration mark should sort. We'll use these rules to tailor its weight.

3. Kanji
  CLDR defines sort order of 6355 Kanji characters. All these Kanji characters
  are in the same set as defined by JIS X 0208, the most recent version of
  Japanese Industrial Standards.

  All these Japanese Kanji characters are in range [U+4E00, U+9FFF]. In DUCET,
  there is no collation element assigned to any code point in this range.
  Which means their weight needs to be calculated as implicit weight in other
  collations. The implicit weight contains two collation elements. Because
  Japanese Kanji characters no doubt are common for Japanese collation, we
  don't use their implicit weight, instead we give them tailored explicit
  weight according to the collating order defined in CLDR. This given explicit
  weight will have only one collation element. This is better from performance
  point of view because it can reduce the time looking up in weight table. The
  given weight will be greater than the maximum weight of Kana characters.

  For other non-Japanese Han characters, we keep their implicit weight ([FB80 -
  FB85, 0000, 0000][P2, S2, T2]).

4. Reorder
  CLDR defines reorder rule [Latin, Kana, Hani] for Japanese, which means Kana
  characters should compare greater than Latin, and Han characters should
  compare greater than Kana. But there are other character groups between them,
  for example, in DUCET, Greek (and Coptic, Cyrillic etc) characters are
  between Latin and Kana. So we can see the origin character groups as [Latin,
  CharA, Kana, CharB, Hani, others]. Japanese collation should re-arrange them
  as [Latin, Kana, Hani, CharA, CharB, others].

  For Latin and Kana group, the reorder implementation in WL#9108 can reorder
  them well. But for CharA and CharB, because non-Japanese Han characters are
  using implicit weight according to UCA, moving CharA and CharB to be after
  all Han characters means we need to give CharA and CharB greater weight. The
  maximum implicit weight of Han characters defined by UCA is [FB85, 0020,
  0002][P2, 0000, 0000]. So if one character in CharA or CharB has weight
  [P, S, T] in DUCET, giving it an ajusted weight as [FB86, 0000, 0000][P, S,
  T] is enough to achieve this (the value of P, S and T won't change).

  Please note that reorder only happen at the primary weight. So the primary
  weight range of all these involved character groups are:
  Char Group       | Origin Weight Range         | Tailored Weight Range
  ----------------------------------------------------------------------------
  Latin            | 1C47 -- 1FB5                | 1C47 -- 1FB5
  CharA            | 1FB9 -- 3D59                | [FB86, 1FB9] - [FB86, 3D59]
  Kana             | 3D5A -- 3D8B                | 1FB6 -- 1FE7
  CharB            | 3D89 -- 54A3                | [FB86, 3D89] - [FB86, 54A3]
  Japanese Han     | [FB40, AAAA] - [FB41, BBBB] | 54A4 -- 6D76
  non-Japanese Han | [FB40, XXXX] - [FB85, YYYY] | [FB40, XXXX] - [FB85, YYYY]