Documentation Home
MySQL Globalization
Related Documentation Download this Excerpt
PDF (US Ltr) - 496.3Kb
PDF (A4) - 494.9Kb


1.14.4.2 LDML Syntax Supported in MySQL

This section describes the LDML syntax that MySQL recognizes. This is a subset of the syntax described in the LDML specification available at http://www.unicode.org/reports/tr35/, which should be consulted for further information. MySQL recognizes a large enough subset of the syntax that, in many cases, it is possible to download a collation definition from the Unicode Common Locale Data Repository and paste the relevant part (that is, the part between the <rules> and </rules> tags) into the MySQL Index.xml file. The rules described here are all supported except that character sorting occurs only at the primary level. Rules that specify differences at secondary or higher sort levels are recognized (and thus can be included in collation definitions) but are treated as equality at the primary level.

The MySQL server generates diagnostics when it finds problems while parsing the Index.xml file. See Section 1.14.4.3, “Diagnostics During Index.xml Parsing”.

Character Representation

Characters named in LDML rules can be written literally or in \unnnn format, where nnnn is the hexadecimal Unicode code point value. For example, A and á can be written literally or as \u0041 and \u00E1. Within hexadecimal values, the digits A through F are not case-sensitive; \u00E1 and \u00e1 are equivalent. For UCA 4.0.0 collations, hexadecimal notation can be used only for characters in the Basic Multilingual Plane, not for characters outside the BMP range of 0000 to FFFF. For UCA 5.2.0 collations, hexadecimal notation can be used for any character.

The Index.xml file itself should be written using UTF-8 encoding.

Syntax Rules

LDML has reset rules and shift rules to specify character ordering. Orderings are given as a set of rules that begin with a reset rule that establishes an anchor point, followed by shift rules that indicate how characters sort relative to the anchor point.

  • A <reset> rule does not specify any ordering in and of itself. Instead, it resets the ordering for subsequent shift rules to cause them to be taken in relation to a given character. Either of the following rules resets subsequent shift rules to be taken in relation to the letter 'A':

    <reset>A</reset>
    <reset>\u0041</reset>
  • The <p>, <s>, and <t> shift rules define primary, secondary, and tertiary differences of a character from another character:

    • Use primary differences to distinguish separate letters.

    • Use secondary differences to distinguish accent variations.

    • Use tertiary differences to distinguish lettercase variations.

    Either of these rules specifies a primary shift rule for the 'G' character:

    <p>G</p>
    <p>\u0047</p>
  • The <i> shift rule indicates that one character sorts identically to another. The following rules cause 'b' to sort the same as 'a':

    <reset>a</reset>
    <i>b</i>
  • Abbreviated shift syntax specifies multiple shift rules using a single pair of tags. The following table shows the correspondence between abbreviated syntax rules and the equivalent nonabbreviated rules.

    Table 1.5 Abbreviated Shift Syntax

    Abbreviated Syntax Nonabbreviated Syntax
    <pc>xyz</pc> <p>x</p><p>y</p><p>z</p>
    <sc>xyz</sc> <s>x</s><s>y</s><s>z</s>
    <tc>xyz</tc> <t>x</t><t>y</t><t>z</t>
    <ic>xyz</ic> <i>x</i><i>y</i><i>z</i>

  • An expansion is a reset rule that establishes an anchor point for a multiple-character sequence. MySQL supports expansions 2 to 6 characters long. The following rules put 'z' greater at the primary level than the sequence of three characters 'abc':

    <reset>abc</reset>
    <p>z</p>
  • A contraction is a shift rule that sorts a multiple-character sequence. MySQL supports contractions 2 to 6 characters long. The following rules put the sequence of three characters 'xyz' greater at the primary level than 'a':

    <reset>a</reset>
    <p>xyz</p>
  • Long expansions and long contractions can be used together. These rules put the sequence of three characters 'xyz' greater at the primary level than the sequence of three characters 'abc':

    <reset>abc</reset>
    <p>xyz</p>
  • Normal expansion syntax uses <x> plus <extend> elements to specify an expansion. The following rules put the character 'k' greater at the secondary level than the sequence 'ch'. That is, 'k' behaves as if it expands to a character after 'c' followed by 'h':

    <reset>c</reset>
    <x><s>k</s><extend>h</extend></x>

    This syntax permits long sequences. These rules sort the sequence 'ccs' greater at the tertiary level than the sequence 'cscs':

    <reset>cs</reset>
    <x><t>ccs</t><extend>cs</extend></x>

    The LDML specification describes normal expansion syntax as tricky. See that specification for details.

  • Previous context syntax uses <x> plus <context> elements to specify that the context before a character affects how it sorts. The following rules put '-' greater at the secondary level than 'a', but only when '-' occurs after 'b':

    <reset>a</reset>
    <x><context>b</context><s>-</s></x>
  • Previous context syntax can include the <extend> element. These rules put 'def' greater at the primary level than 'aghi', but only when 'def' comes after 'abc':

    <reset>a</reset>
    <x><context>abc</context><p>def</p><extend>ghi</extend></x>
  • Reset rules permit a before attribute. Normally, shift rules after a reset rule indicate characters that sort after the reset character. Shift rules after a reset rule that has the before attribute indicate characters that sort before the reset character. The following rules put the character 'b' immediately before 'a' at the primary level:

    <reset before="primary">a</reset>
    <p>b</p>

    Permissible before attribute values specify the sort level by name or the equivalent numeric value:

    <reset before="primary">
    <reset before="1">
    <reset before="secondary">
    <reset before="2">
    <reset before="tertiary">
    <reset before="3">
  • A reset rule can name a logical reset position rather than a literal character:

    <first_tertiary_ignorable/>
    <last_tertiary_ignorable/>
    <first_secondary_ignorable/>
    <last_secondary_ignorable/>
    <first_primary_ignorable/>
    <last_primary_ignorable/>
    <first_variable/>
    <last_variable/>
    <first_non_ignorable/>
    <last_non_ignorable/>
    <first_trailing/>
    <last_trailing/>

    These rules put 'z' greater at the primary level than nonignorable characters that have a Default Unicode Collation Element Table (DUCET) entry and that are not CJK:

    <reset><last_non_ignorable/></reset>
    <p>z</p>

    Logical positions have the code points shown in the following table.

    Table 1.6 Logical Reset Position Code Points

    Logical Position Unicode 4.0.0 Code Point Unicode 5.2.0 Code Point
    <first_non_ignorable/> U+02D0 U+02D0
    <last_non_ignorable/> U+A48C U+1342E
    <first_primary_ignorable/> U+0332 U+0332
    <last_primary_ignorable/> U+20EA U+101FD
    <first_secondary_ignorable/> U+0000 U+0000
    <last_secondary_ignorable/> U+FE73 U+FE73
    <first_tertiary_ignorable/> U+0000 U+0000
    <last_tertiary_ignorable/> U+FE73 U+FE73
    <first_trailing/> U+0000 U+0000
    <last_trailing/> U+0000 U+0000
    <first_variable/> U+0009 U+0009
    <last_variable/> U+2183 U+1D371

  • The <collation> element permits a shift-after-method attribute that affects character weight calculation for shift rules. The attribute has these permitted values:

    • simple: Calculate character weights as for reset rules that do not have a before attribute. This is the default if the attribute is not given.

    • expand: Use expansions for shifts after reset rules.

    Suppose that '0' and '1' have weights of 0E29 and 0E2A and we want to put all basic Latin letters between '0' and '1':

    <reset>0</reset>
    <pc>abcdefghijklmnopqrstuvwxyz</pc>

    For simple shift mode, weights are calculated as follows:

    'a' has weight 0E29+1
    'b' has weight 0E29+2
    'c' has weight 0E29+3
    ...

    However, there are not enough vacant positions to put 26 characters between '0' and '1'. The result is that digits and letters are intermixed.

    To solve this, use shift-after-method="expand". Then weights are calculated like this:

    'a' has weight [0E29][233D+1]
    'b' has weight [0E29][233D+2]
    'c' has weight [0E29][233D+3]
    ...

    233D is the UCA 4.0.0 weight for character 0xA48C, which is the last nonignorable character (a sort of the greatest character in the collation, excluding CJK). UCA 5.2.0 is similar but uses 3ACA, for character 0x1342E.

MySQL-Specific LDML Extensions

An extension to LDML rules permits the <collation> element to include an optional version attribute in <collation> tags to indicate the UCA version on which the collation is based. If the version attribute is omitted, its default value is 4.0.0. For example, this specification indicates a collation that is based on UCA 5.2.0:

<collation id="nnn" name="utf8mb4_xxx_ci" version="5.2.0">
...
</collation>