WL#5624: Collation customization improvements

Affects: Server-5.6   —   Status: Complete   —   Priority: Medium

We have a few feature requests for more language collations.
Some of the feature requests even include a modified Index.xml
file which is supposed to add a collation for a certain
language according to the manual article "Adding a UCA
collation to a Unicode Character Set". But it does not always work,
because we do not support a few collation customization 
features widely used in world languages.

This task will not add any particular language collations.
The goal of this task is to extend the MySQL collation
customization system so more people can simply download
a collation definition from Unicode's Common Locale
Data Repository and paste its relevant part into Index.xml
file (namely, the part between <rules> and </rules> tags).

This WL uses Unicode's TR35, whose version at time of writing was 1.8.1:
The most important missing features are:

1. Long contractions - reset to more than 2 characters

  LDML  Example: <reset>abc</reset><p>z</p>
  Basic Example: &abc < z
This puts character 'z' primary greater than
a sequence of three characters a,b,c.
We'll support up to six character long contractions.

2. Long expansions - shift for a sequence of more than 2 characters

  LDML  Example: <reset>a</reset><p>xyz</p>
  Basic Example: &a < xyz

This puts a sequence of three characters
x,y,z primary greater than character 'a'.
We'll support up to six character long expansions.

3. A combination of N1 and N2:

  LDML  Example: <reset>abc</reset><p>xyz</p>
  Basic Example: &abc < xyz

This puts a sequence of three characters 'xyz' primary greater
than a sequence of three characters 'abc'.

4. Reset before

  <reset before="primary">
  <reset before="1">

  <reset before="secondary">
  <reset before="2">

  <reset before="tertiary">
  <reset before="3">

  <reset before="quaternary">
  <reset before="4">

  &[before primary]
  &[before 1]

  &[before secondary]
  &[before 2]

  &[before tertiary]
  &[before 3]

  &[before quaternary]
  &[before 4]

  LDML  Example: <reset before="primary">a</reset><p>b</p>
  Basic Example: &[before 1]a < b

  puts letter 'b' immediately before 'a' on primary level.

As we support only primary level collations at the moment,
<reset before="?">a</reset><?>b</?> for secondary and all higher
levels will do effectively the same as <reset>a</reset><i>p</i>,
i.e. will make the shift character equal to the reset character.

For primary level we'll calculate weight as follows.

Suppose we have this rule:  &B[before primary] < C
i.e. we need to put C before B, but after A, so   
the result order is: A < C < B.
Let primary weight of B be [BBBB].
We cannot just use [BBBB-1] as weight for C:
DUCET does not have enough unused weights between any two characters,
so using [BBBB-1] will likely make C equal to the previous character,
which is A, so we'll get this order instead of the desired: A = C < B.
To guarantee that that C is sorted after A, we'll use expansion
with a kind of "large enough character".
As "large enough character" we'll use "last_non_ignorable",
which is the character with the largest weight in DUCET
(excluding CJK characters).

For Unicode version 4.0.0 last_non_ignorable will be U+A48C:
A48C  ; [.233D.0020.0002.A48C] # YI SYLLABLE YYR

For Unicode version 5.2.0 last_non_ignorable will be U+1342E:
1342E ; [.3ACA.0020.0002.1342E] # EGYPTIAN HIEROGLYPH AA032

We'll compose weight for C as: [BBBB-1][MMMM+1]
where [MMMM] is weight for "last_non_ignorable".

5. Long tailoring.
Currently tailoring size is limited to 1Kb,
which is often not enough. Limitation happens in
the file strings/ctype.h in this static size array:

#define MY_CS_TAILORING_SIZE    1024
typedef struct my_cs_file_info
  char   tailoring[MY_CS_TAILORING_SIZE];

The biggest file in the current CLDR version is about 420Kb (zh.xml).
The code should be changed to use dynamically allocated buffers
instead of this static size array.

6. Letters rather than escape sequences.
Currently, for non-ASCII characters, we only support
escape sequences for in collation customization:

  LDML  Example: <reset>c</reset><p>\u010A</p><t>\u010B</t> 
  Basic Example: &c < \u010A <<< \u010B

We will allow using characters as well:

  LDML  Example: <reset>c</reset><p>ċ</p><t>Ċ</t> 
  Basic Example: &c < ċ <<< Ċ

(In the examples above \u010A and \u010B are escape
sequences for C WITH DOT ABOVE, which we can replace
with the actual characters ċ and Ċ, as shown.)

Note: we'll require that data inserted into Index.xml 
file uses only utf-8 encoding.

7. Quaternary difference

  LDML  : <reset>a</reset><q>b</q>
  Basic : &a <<<< b
  Make 'b' sort after 'a' on quaternary level.

8. Abbreviated shift syntax:

  LDML Example:
  <pc>xyz</pc>  equal to  <p>x</p><p>y</p><p>z</p>
  <sc>xyz</sc>  equal to  <s>x</s><s>y</s><s>z</s>
  <tc>xyz</tc>  equal to  <t>x</t><t>y</t><t>z</t>
  <qc>xyz</qc>  equal to  <q>x</q><q>y</q><q>z</q>
  <ic>xyz</ic>  equal to  <i>x</i><i>y</i><i>z</i>

  Note, basic format does not have an analogue for abbreviated shift syntax,
  so it will be translated to a set of regular shift rules:

  LDML:  <reset>a</reset><pc>xyz</pc>
  Basic: & a < x < y < z

  # Note, in the version 1.9.1 or TR35, Unicode introduced
  # basic abbreviated syntax using the ASTERISK sign:
  # LDML:  <reset>a</reset><pc>xyz</pc>
  # Basic: & a <* xyz
  # This feature will be done in a separate WL.

9. Normal expansion syntax

  LDML  Example: <reset>c</reset><x><s>k</s><extend>h</extend></x>
  Basic Example: &c << k / h

  Make 'k' sort after the sequence 'ch';  thus 'k' will behave as if
  it expands to a character after 'c' followed by an 'h'.

  We'll allow long sequences in normal expansion syntax too:
  LDML  Example: <reset>cs</reset><x><t>ccs</t><extend>cs</extend></x>
  Basic Example: &cs <<< ccs / cs

  Make 'ccs' sort after the sequence 'cscs' (on tertiary level).

  We'll support up to six character long resulting sequences.

10. Previous context

Unicode's tr35 says:

>The context before a character can affect how it is ordered, such as in 
>Japanese. This could be expressed with a combination of contractions and 
>expansions, but is faster using a context. (The actual weights produced are 
>different, but the resulting string comparisons are the same.) If a context 
>element occurs, it must be the first item in the rule, and requires an <x> 
>For example, suppose that "-" is sorted like the previous vowel. Then one 
>could have rules that take "a-", "e-", and so on. However, that means that 
>every time a very common character (a, e, ...) is encountered, a system will 
>slow down as it looks for possible contractions. An alternative is to indicate 
>that when "-" is encountered, and it comes after an 'a', it sorts like an 'a', 
>and so on. 

  LDML:  <reset>a</reset><x><context>b</context><s>-</s></x>
  Basic: & a <<< b | -

  Makes '-' sort tertiary greater than 'a', but only when '-' goes after 'b'.

  In the current CLDR version, "context before" appears only in ja.xml.
  This section will be optional. Developer will decide whether to implement
  this depending on coding complexity and time frame.
  In case it is not implemented, we can postpone this feature
  until "WL#2555 Standard Japanese Collation" time.

11. Previous context with expansion

  LDML: <reset>a</reset><x><context>abc</context><p>def</p><extend>ghi</extend></x>

  Basic "Sequence" expansion:  &aghi < abc | def
  Basic "Normal" expansion  :     &a < abc | def / ghi

  Makes 'def' sort primary greater than 'aghi', but only
  when 'def' comes after 'abc'.

Note, as Basic analogue is not shown explicitly in tr35,
it was checked against IBM's ICU library (the most famous
open source library supporting UCA).
ICU does understand both kinds of basic syntax:
sequence expansion with previous context and
normal expansion with previous context.

Note, there are no real examples of "previous context with expansion"
in the current version of CLDR. We will not do this feature
under terms of WL#5624. We'll postpone it until a real collations using this 
feature appear either in CLDR or in any other authoritative source.

12. Logical reset positions


  [first non-ignorable]
  [last non-ignorable]
  [first primary ignorable]
  [last primary ignorable]
  [first secondary ignorable]
  [last secondary ignorable]
  [first tertiary ignorable]
  [last tertiary ignorable]
  [first trailing]
  [last trailing]
  [first variable]
  [last variable]

  LDML Example: <reset><last_non_ignorable/></reset><p>z</p>
  Text Example: &[last non-ignorable] < z
  Make letter 'z' sort after all primary non-ignorable characters
  which have a DUCET entry and which are not CJK.

We'll use the following code points as logical positions:

For Unicode-4.0.0 collations:

  U+02D0 -   first_non_ignorable
  U+A48C -   last_non_ignorable
  U+0332 -   first_primary_ignorable
  U+20EA -   last_primary_ignorable
  U+0000 -   first_secondary_ignorable
  U+FE73 -   last_secondary_ignorable
  U+0000 -   first_tertiary_ignorable
  U+FE73 -   last_tertiary_ignorable
  U+0000 -   first_trailing
  U+0000 -   last_trailing
  U+0009 -   first_variable
  U+2183 -   last_variable

For Unicode-5.2.0 collations:
  U+02D0  -  first_non_ignorable
  U+1342E -  last_non_ignorable
  U+0332  -  first_primary_ignorable
  U+101FD -  last_primary_ignorable
  U+0000  -  first_secondary_ignorable
  U+FE73  -  last_secondary_ignorable
  U+0000  -  first_tertiary_ignorable
  U+FE73  -  last_tertiary_ignorable
  U+0000  -  first_trailing
  U+0000  -  last_trailing
  U+0009  -  first_variable
  U+1D371 -  last_variable

13. More verbosity when reading character set definition file (Index.xml)

Currently Index.xml parser prints error diagnostics
to server log only when fatal XML syntax errors happen, for example:

mysqld: Error while parsing '/usr/local/mysql-5.6/share/charsets/Index.xml': at
line 673 pos 15: '</collation2>' unexpected ('</collation>' wanted)

In the case when Index.xml parser meets an unknown tag or attribute,
now it just silently ignores the unknown parts, which makes searching
for mistakes in Index.xml hard.

We will change behaviour to refuse loading LDML definitions 
having any unknown parts. An attempt to use a collation with
any unknown XML pieces will result into an error on the client side,
as well as into a warning to server log,
for example:

[Warning] Unknown tag or attribute in
'/usr/local/mysql-5.6/share/charsets/Index.xml': at line 673 pos 15:

14. Special purpose commands and collation settings

  <settings strength="...">
  <settings alternate="...">
  <settings backwards="...">
  <settings normalization="...">
  <settings caseLevel="...">
  <settings caseFirst="...">
  <settings hiragana­Quaternary="...">
  <settings numeric="...">
  <settings variableTop="...">
  <settings match-boundaries="...">
  <settings match-style="...">

  [suppress contractions <character ranges>]    
  [optimize <character ranges>]                  (e.g. [optimize [A-Z]])
  [strength 1|2|3|4|I|primary|secondary|tertiary|identical]
  [alternate non-ignorable|shifted]
  [backwards on|off|2]
  [normalization on|off]
  [caseLevel on|off]
  [caseFirst upper|lower|off]
  [hiraganaQ on|off]
  [variableTop <encoded Unicode string>]
  [numeric on|off]
  [match-boundaries none|whole-character|whole-word]
  [match-style minimal|medial|maximal]

We will need to fully support these commands and settings
after we have "WL#896 Primary, Secondary and Tertiary Sorts" done.
Now we will throw an error on the client side and a warning
into server log whenever this kind of syntax is met,
as well as whenever any other kinds of syntax we don't understand are met.

Error should happen only on attempt to use a collation
with unknown definition pieces. An unknown tag/attribute
in a certain collation LDML definition should not prevent server
neither from loading the rest of Index.xml file, nor
from further use of the other (well-defined) collations
in the same Index.xml.

So errors and warnings will be printed at different moments:

- errors on the client side will be produced only on a real attempt
  to use a collation with unknown syntax parts.

- warnings about unknown tags/attributes in LDML syntax
  will be printed to server log on server start up,
  during loading of Index.xml (so server administrator 
  can see that something is wrong and some collations
  may not work)

Tests should cover all new features using definition
examples in mysql-test/std_data/Index.xml with corresponding
SQL tests in mysql-test/t/ctype_ldml.test.

Additionally, the implementer will add tests into ctype_ldml.test
using CLDR definitions for Myanmar (my.xml), Bengali (bn.xml)
and Maltese (mt.xml) languages pasted in mysql-test/std_data/Index.xml.
Any other additional tests covering the new features will be welcome as well.

Bernt suggested that it's worthy to mention in the manual
that after modifying a user defined collation in Index.xml
file one should rebuild indexes for all columns that use
this collation.

- BUG#22008 Myanmar Collation Extension for UCS and UTF8
- BUG#37898 Please add Bengali (Bangladesh) [bn_BD]language collation
- BUG#32540 UTF8 collation for the Maltese alphabet is incomplete
- WL#5619 Maltese collation
- MySQL Forum thread:
- Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML):
- ICU User Guide: Collation customization
Change the code which handles LDML.