WL#5624: Collation customization improvements

Affects: Server-5.6   —   Status: Complete

We have a few feature requests for more language collations.
Some of the feature requests even include a modified Index.xml
file which is supposed to add a collation for a certain
language according to the manual article "Adding a UCA
collation to a Unicode Character Set". But it does not always work,
because we do not support a few collation customization 
features widely used in world languages.


This task will not add any particular language collations.
The goal of this task is to extend the MySQL collation
customization system so more people can simply download
a collation definition from Unicode's Common Locale
Data Repository and paste its relevant part into Index.xml
file (namely, the part between  and  tags).

This WL uses Unicode's TR35, whose version at time of writing was 1.8.1:
http://unicode.org/reports/tr35/tr35-16.html
The most important missing features are:

1. Long contractions - reset to more than 2 characters

  LDML  Example: abc

z

Basic Example: &abc < z This puts character 'z' primary greater than a sequence of three characters a,b,c. We'll support up to six character long contractions. 2. Long expansions - shift for a sequence of more than 2 characters LDML Example: a

xyz

Basic Example: &a < xyz This puts a sequence of three characters x,y,z primary greater than character 'a'. We'll support up to six character long expansions. 3. A combination of N1 and N2: LDML Example: abc

xyz

Basic Example: &abc < xyz This puts a sequence of three characters 'xyz' primary greater than a sequence of three characters 'abc'. 4. Reset before LDML: Basic: &[before primary] &[before 1] &[before secondary] &[before 2] &[before tertiary] &[before 3] &[before quaternary] &[before 4] LDML Example: a

b

Basic Example: &[before 1]a < b puts letter 'b' immediately before 'a' on primary level. As we support only primary level collations at the moment, ab for secondary and all higher levels will do effectively the same as ap, i.e. will make the shift character equal to the reset character. For primary level we'll calculate weight as follows. Suppose we have this rule: &B[before primary] < C i.e. we need to put C before B, but after A, so the result order is: A < C < B. Let primary weight of B be [BBBB]. We cannot just use [BBBB-1] as weight for C: DUCET does not have enough unused weights between any two characters, so using [BBBB-1] will likely make C equal to the previous character, which is A, so we'll get this order instead of the desired: A = C < B. To guarantee that that C is sorted after A, we'll use expansion with a kind of "large enough character". As "large enough character" we'll use "last_non_ignorable", which is the character with the largest weight in DUCET (excluding CJK characters). For Unicode version 4.0.0 last_non_ignorable will be U+A48C: A48C ; [.233D.0020.0002.A48C] # YI SYLLABLE YYR For Unicode version 5.2.0 last_non_ignorable will be U+1342E: 1342E ; [.3ACA.0020.0002.1342E] # EGYPTIAN HIEROGLYPH AA032 We'll compose weight for C as: [BBBB-1][MMMM+1] where [MMMM] is weight for "last_non_ignorable". 5. Long tailoring. Currently tailoring size is limited to 1Kb, which is often not enough. Limitation happens in the file strings/ctype.h in this static size array: #define MY_CS_TAILORING_SIZE 1024 typedef struct my_cs_file_info { ... char tailoring[MY_CS_TAILORING_SIZE]; ... } MY_CHARSET_LOADER; The biggest file in the current CLDR version is about 420Kb (zh.xml). The code should be changed to use dynamically allocated buffers instead of this static size array. 6. Letters rather than escape sequences. Currently, for non-ASCII characters, we only support escape sequences for in collation customization: LDML Example: c

\u010A

\u010B Basic Example: &c < \u010A <<< \u010B We will allow using characters as well: LDML Example: c

ċ

Ċ Basic Example: &c < ċ <<< Ċ (In the examples above \u010A and \u010B are escape sequences for C WITH DOT ABOVE, which we can replace with the actual characters ċ and Ċ, as shown.) Note: we'll require that data inserted into Index.xml file uses only utf-8 encoding. 7. Quaternary difference LDML : ab Basic : &a <<<< b Make 'b' sort after 'a' on quaternary level. 8. Abbreviated shift syntax: LDML Example: xyz equal to

x

y

z

xyz equal to xyz xyz equal to xyz xyz equal to xyz xyz equal to xyz Note, basic format does not have an analogue for abbreviated shift syntax, so it will be translated to a set of regular shift rules: LDML: axyz Basic: & a < x < y < z # Note, in the version 1.9.1 or TR35, Unicode introduced # basic abbreviated syntax using the ASTERISK sign: # LDML: axyz # Basic: & a <* xyz # This feature will be done in a separate WL. 9. Normal expansion syntax LDML Example: ckh Basic Example: &c << k / h Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it expands to a character after 'c' followed by an 'h'. We'll allow long sequences in normal expansion syntax too: LDML Example: csccscs Basic Example: &cs <<< ccs / cs Make 'ccs' sort after the sequence 'cscs' (on tertiary level). We'll support up to six character long resulting sequences. 10. Previous context Unicode's tr35 says: >The context before a character can affect how it is ordered, such as in >Japanese. This could be expressed with a combination of contractions and >expansions, but is faster using a context. (The actual weights produced are >different, but the resulting string comparisons are the same.) If a context >element occurs, it must be the first item in the rule, and requires an >element. >For example, suppose that "-" is sorted like the previous vowel. Then one >could have rules that take "a-", "e-", and so on. However, that means that >every time a very common character (a, e, ...) is encountered, a system will >slow down as it looks for possible contractions. An alternative is to indicate >that when "-" is encountered, and it comes after an 'a', it sorts like an 'a', >and so on. LDML: ab- Basic: & a <<< b | - Makes '-' sort tertiary greater than 'a', but only when '-' goes after 'b'. In the current CLDR version, "context before" appears only in ja.xml. This section will be optional. Developer will decide whether to implement this depending on coding complexity and time frame. In case it is not implemented, we can postpone this feature until "WL#2555 Standard Japanese Collation" time. 11. Previous context with expansion LDML: aabc

def

ghi
Basic "Sequence" expansion: &aghi < abc | def Basic "Normal" expansion : &a < abc | def / ghi Makes 'def' sort primary greater than 'aghi', but only when 'def' comes after 'abc'. Note, as Basic analogue is not shown explicitly in tr35, it was checked against IBM's ICU library (the most famous open source library supporting UCA). ICU does understand both kinds of basic syntax: sequence expansion with previous context and normal expansion with previous context. Note, there are no real examples of "previous context with expansion" in the current version of CLDR. We will not do this feature under terms of WL#5624. We'll postpone it until a real collations using this feature appear either in CLDR or in any other authoritative source. 12. Logical reset positions LDML: Basic: [first non-ignorable] [last non-ignorable] [first primary ignorable] [last primary ignorable] [first secondary ignorable] [last secondary ignorable] [first tertiary ignorable] [last tertiary ignorable] [first trailing] [last trailing] [first variable] [last variable] LDML Example:

z

Text Example: &[last non-ignorable] < z Make letter 'z' sort after all primary non-ignorable characters which have a DUCET entry and which are not CJK. We'll use the following code points as logical positions: For Unicode-4.0.0 collations: U+02D0 - first_non_ignorable U+A48C - last_non_ignorable U+0332 - first_primary_ignorable U+20EA - last_primary_ignorable U+0000 - first_secondary_ignorable U+FE73 - last_secondary_ignorable U+0000 - first_tertiary_ignorable U+FE73 - last_tertiary_ignorable U+0000 - first_trailing U+0000 - last_trailing U+0009 - first_variable U+2183 - last_variable For Unicode-5.2.0 collations: U+02D0 - first_non_ignorable U+1342E - last_non_ignorable U+0332 - first_primary_ignorable U+101FD - last_primary_ignorable U+0000 - first_secondary_ignorable U+FE73 - last_secondary_ignorable U+0000 - first_tertiary_ignorable U+FE73 - last_tertiary_ignorable U+0000 - first_trailing U+0000 - last_trailing U+0009 - first_variable U+1D371 - last_variable 13. More verbosity when reading character set definition file (Index.xml) Currently Index.xml parser prints error diagnostics to server log only when fatal XML syntax errors happen, for example: mysqld: Error while parsing '/usr/local/mysql-5.6/share/charsets/Index.xml': at line 673 pos 15: '' unexpected ('' wanted) In the case when Index.xml parser meets an unknown tag or attribute, now it just silently ignores the unknown parts, which makes searching for mistakes in Index.xml hard. We will change behaviour to refuse loading LDML definitions having any unknown parts. An attempt to use a collation with any unknown XML pieces will result into an error on the client side, as well as into a warning to server log, for example: [Warning] Unknown tag or attribute in '/usr/local/mysql-5.6/share/charsets/Index.xml': at line 673 pos 15: 'some-unknow-tag-or-attribute' 14. Special purpose commands and collation settings LDML: ... ... Basic: [suppress contractions ] [optimize ] (e.g. [optimize [A-Z]]) [strength 1|2|3|4|I|primary|secondary|tertiary|identical] [alternate non-ignorable|shifted] [backwards on|off|2] [normalization on|off] [caseLevel on|off] [caseFirst upper|lower|off] [hiraganaQ on|off] [variableTop ] [numeric on|off] [match-boundaries none|whole-character|whole-word] [match-style minimal|medial|maximal] We will need to fully support these commands and settings after we have "WL#896 Primary, Secondary and Tertiary Sorts" done. Now we will throw an error on the client side and a warning into server log whenever this kind of syntax is met, as well as whenever any other kinds of syntax we don't understand are met. Error should happen only on attempt to use a collation with unknown definition pieces. An unknown tag/attribute in a certain collation LDML definition should not prevent server neither from loading the rest of Index.xml file, nor from further use of the other (well-defined) collations in the same Index.xml. So errors and warnings will be printed at different moments: - errors on the client side will be produced only on a real attempt to use a collation with unknown syntax parts. - warnings about unknown tags/attributes in LDML syntax will be printed to server log on server start up, during loading of Index.xml (so server administrator can see that something is wrong and some collations may not work) Testing: Tests should cover all new features using definition examples in mysql-test/std_data/Index.xml with corresponding SQL tests in mysql-test/t/ctype_ldml.test. Additionally, the implementer will add tests into ctype_ldml.test using CLDR definitions for Myanmar (my.xml), Bengali (bn.xml) and Maltese (mt.xml) languages pasted in mysql-test/std_data/Index.xml. Any other additional tests covering the new features will be welcome as well. Documenting: Bernt suggested that it's worthy to mention in the manual that after modifying a user defined collation in Index.xml file one should rebuild indexes for all columns that use this collation. References: - BUG#22008 Myanmar Collation Extension for UCS and UTF8 - BUG#37898 Please add Bengali (Bangladesh) [bn_BD]language collation - BUG#32540 UTF8 collation for the Maltese alphabet is incomplete - WL#5619 Maltese collation - MySQL Forum thread: http://forums.mysql.com/read.php?103,183888,183888 - Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML): http://unicode.org/reports/tr35/ - ICU User Guide: Collation customization http://userguide.icu-project.org/collation/customization
Change the code which handles LDML.