5.5.2 Denominations With Regards to Collations

This section contains a list of terms and definitions used in the context of collations.

  • allkeys.txt: An example of a series of collating-table entries as defined by UCA ( Unicode Collation Algorithm).

    Actually UCA says allkeys.txt is a collation element table, that is, it is the part of a collating table which shows collating elements.

  • COLLATING ELEMENT: The unit which linguistically-aware users perceive as the minimal building block in string comparisons.

    Usually there is a one-to-one relation between characters and collating elements, for example in English there is a character A and a collating element for A. More rarely there is a many-to-one relation, for example in traditional Spanish the two-character combination LL is a single collating element.

    Usually there is a one-to-many relation between collating elements and weights (because there are multiple levels); however, for an ignorable character, one collating element has zero weights.

  • COLLATING TABLE: A table which describes all the rules for a collation, including Posix-like Locale declarations and a list of collating elements.

    Here are entries for collating elements from two sources, ISO 14651 and allkeys.txt:

    [From ISO 14651]
    <U0024> <S2C4>;<BASE>;<MIN>;<U0024> % DOLLAR SIGN

    [From allkeys.txt 4.0]
    0024  ; [.0E0F.0020.0002.0024] # DOLLAR SIGN
    FF04  ; [.0E0F.0020.0003.FF04] # FULLWIDTH DOLLAR SIGN; QQK
    FE69  ; [.0E0F.0020.000F.FE69] # SMALL DOLLAR SIGN; QQK

    Clearly these are the same thing, but ISO 14651 uses names (e.g. BASE) where allkeys.txt uses numbers (e.g. 0020). So ISO 14651 had to define earlier in its table BASE = 0020; MIN = 0002; WIDE = 0003; SMALL= 000F etc.

  • COLLATION ELEMENT: Do not use. Use collating element.

  • COLLATION TABLE: Do not use. Use collating table.

  • COLLATING TABLE ENTRY: A line in a collating table, representing one fact.

    Each line in allkeys.txt (which is a subset of a collating table) is an entry for one collating element.

  • CONTRACTION: A mapping from N characters to less-than-N collation elements.

    Contraction is rare, for example the character C has one collation element C. But take an example from traditional Spanish: LL is a single collation element between L and M. Contraction also occurs when there has been decomposition. For example here are two collating element entries (from allkeys.txt 5.0.0):

    0622  ; [.15E2.0020.0002.0622] # ARABIC LETTER ALEF WITH MADDA ABOVE
    0627 0653 ; [.15E2.0020.0002.0622] # ARABIC LETTER ALEF WITH MADDA ABOVE

    Notice that there is one collation element labelled 0627 0653, which clearly is the result of mapping from two characters U+0627 ARABIC LETTER ALEF and U+0653 ARABIC MADDAH ABOVE, with the same weights as the composed character U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE.

  • EXPANSION: A one-to-many mapping from collating element to weighting levels.

    For example, German Sharp S may be treated as ss, so the allkeys.txt entry for collating element 00DF (Sharp S) is:

    00DF  ; [.11AF.0020.0004.00DF][.0000.0199.0004.00DF][.11AF.0020.001F.00DF] # LATIN SMALL LETTER SHARP S;

    The entry for s alone is:

    0073  ; [.11AF.0020.0002.0073] # LATIN SMALL LETTER S

    Since 0000 means ignorable, two-level weight strings are:

    11AF 11AF 0020 0199 0020      /* for SHARP S */
    11AF 11AF 0020 0020           /* for 'ss' */

  • IGNORABLE CHARACTER: A character which has one collating element which has no significance for comparison. One ignorable character has one collating element, but zero weights at all levels. For example (from allkeys.txt): 0591 ; [.0000.0000.0000.0591] # HEBREW ACCENT ETNAHTA This is ignorable for three levels but not four levels. Therefore it is an ignorable character when you produce a weight string for one, two, or three levels. Ignorable at level 1 means the level-1 weight is ignorable, as represented by 0000 in allkeys.txt. Fully ignorable means ignorable for all levels.

  • ISO 14651: The ISO/IEC 14651 International String Ordering standard.

    Draft documents:

  • LEVEL: A prioritization order for weights.

    Each level has a name level + number, for example level 1, level 2, level 3, level 4. (Do not use, or rarely use, equivalent terms primary, secondary, tertiary, quaternary.) Typically level 1 is the character-differs level for WHERE clauses, levels 2 and following are case-differs or accent-differs something-minor-differs levels which might be useful for ORDER BY clauses. For example, from allkeys.txt 5.0.0:

    0061  ; [.0FD0.0020.0002.0061] # LATIN SMALL LETTER A
    24D0  ; [.0FD0.0020.0006.24D0] # CIRCLED LATIN SMALL LETTER A;
    0041  ; [.0FD0.0020.0008.0041] # LATIN CAPITAL LETTER A

    There are four levels here. Level 1 is always 0FD0 for A. Level 2 is always 0020. Level 3 is 0002 for SMALL, 0006 for CIRCLED, 0008 for CAPITAL. Level 4 is the same as the Unicode code point value. Do not confuse weight level with weighting level.

  • ORDERING KEY: Do not use. Use weight string.

  • SORTKEY: Do not use. Use weight string.

  • SUBKEY: A sequence of weights for a single level.

  • UCA: Unicode Collation Algorithm as described in Unicode Technical Standard #10, http://www.unicode.org/reports/tr10.

  • WEIGHT: A positive numeric value used for comparisons.

    Weights come from collating tables and go to weight strings. Often weight appears as a 4-digit number in collating tables. For example (from allkeys.txt):

    0062  ; [.0FE6.0020.0002.0062] # LATIN SMALL LETTER B

    This is the entry for collating element 0062, and there are 4 weights: 0FE6 and 0020 and 0002 and 0062.

  • WEIGHT STRING: A binary string, sometimes called a sortkey or an ordering key, produced by taking a series of weights from a collating table for a certain number of levels, ordering them by position and level, and outputting.

    For example: starting with a character string ABC, and knowing that the number of levels is 2, look up the collating elements for A and B and C in allkeys.txt 5.0.0:

    0041  ; [.0FD0.0020.0008.0041] # LATIN CAPITAL LETTER A
    0042  ; [.0FE6.0020.0008.0042] # LATIN CAPITAL LETTER B
    0043  ; [.0FFE.0020.0008.0043] # LATIN CAPITAL LETTER C

    Result: 0FD0.0FE6.0FFE.0020.0020.0020. MySQL's weight_string() function produces a weight string.

  • WEIGHTING ELEMENT: A sequence of weights, in ascending order by level.

    For example, from allkeys.txt 5.0.0:

    00DF  ; [.11AF.0020.0004.00DF][.0000.0199.0004.00DF][.11AF.0020.001F.00DF] # LATIN SMALL LETTER SHARP S

    There are three weighting elements in this example, each is surrounded by square brackets:


    Often one collating element has only one weighting element (which has many weights), but SHARP S is an example of expansion.

  • ZERO WEIGHTS: The meaning is an empty sequence of weights (the ISO 14651 definition), not weights with value 0000 (the UCA definition).

    For example (from allkeys.txt):

    0591  ; [.0000.0000.0000.0591] # HEBREW ACCENT ETNAHTA

    There are three empty sequences of weights here, all of which look like 0000, which we interpret as code for empty.