This section contains a list of terms and definitions used in the context of collations.
allkeys.txt: An example of a series of collating-table entries as defined by UCA ( Unicode Collation Algorithm).
Actually UCA says
allkeys.txtis a “collation element table”, that is, it is the part of a collating table which shows collating elements.
COLLATING ELEMENT: The unit which linguistically-aware users perceive as the minimal building block in string comparisons.
Usually there is a one-to-one relation between characters and collating elements, for example in English there is a character “A” and a collating element for “A”. More rarely there is a many-to-one relation, for example in traditional Spanish the two-character combination “LL” is a single collating element.
Usually there is a one-to-many relation between collating elements and weights (because there are multiple levels); however, for an ignorable character, one collating element has zero weights.
COLLATING TABLE: A table which describes all the rules for a collation, including Posix-like “Locale” declarations and a list of collating elements.
Here are entries for collating elements from two sources, ISO 14651 and
[From ISO 14651] <U0024> <S2C4>;<BASE>;<MIN>;<U0024> % DOLLAR SIGN <UFF04> <S2C4>;<BASE>;<WIDE>;<UFF04> % FULLWIDTH DOLLAR SIGN <UFE69> <S2C4>;<BASE>;<SMALL>;<UFE69> % SMALL DOLLAR SIGN
[From allkeys.txt 4.0] 0024 ; [.0E0F.0020.0002.0024] # DOLLAR SIGN FF04 ; [.0E0F.0020.0003.FF04] # FULLWIDTH DOLLAR SIGN; QQK FE69 ; [.0E0F.0020.000F.FE69] # SMALL DOLLAR SIGN; QQK
Clearly these are the same thing, but ISO 14651 uses names (e.g. “BASE”) where
allkeys.txtuses numbers (e.g.
0020). So ISO 14651 had to define earlier in its table
BASE = 0020; MIN = 0002; WIDE = 0003; SMALL= 000Fetc.
COLLATION ELEMENT: Do not use. Use collating element.
COLLATION TABLE: Do not use. Use collating table.
COLLATING TABLE ENTRY: A line in a collating table, representing one fact.
Each “line” in
allkeys.txt(which is a subset of a collating table) is an entry for one collating element.
CONTRACTION: A mapping from
Ncharacters to less-than-
Contraction is rare, for example the character “C” has one collation element “C”. But take an example from traditional Spanish: “LL” is a single collation element between “L” and “M”. Contraction also occurs when there has been decomposition. For example here are two collating element entries (from
0622 ; [.15E2.0020.0002.0622] # ARABIC LETTER ALEF WITH MADDA ABOVE 0627 0653 ; [.15E2.0020.0002.0622] # ARABIC LETTER ALEF WITH MADDA ABOVE
Notice that there is one collation element labelled
0627 0653, which clearly is the result of mapping from two characters
U+0627 ARABIC LETTER ALEFand
U+0653 ARABIC MADDAH ABOVE, with the same weights as the composed character
U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE.
EXPANSION: A one-to-many mapping from collating element to weighting levels.
For example, German Sharp S may be treated as “ss”, so the
allkeys.txtentry for collating element
00DF(Sharp S) is:
00DF ; [.11AF.0020.0004.00DF][.0000.0199.0004.00DF][.11AF.0020.001F.00DF] # LATIN SMALL LETTER SHARP S;
The entry for “s” alone is:
0073 ; [.11AF.0020.0002.0073] # LATIN SMALL LETTER S
0000means ignorable, two-level weight strings are:
11AF 11AF 0020 0199 0020 /* for SHARP S */ 11AF 11AF 0020 0020 /* for 'ss' */
IGNORABLE CHARACTER: A character which has one collating element which has no significance for comparison. One ignorable character has one collating element, but zero weights at all levels. For example (from allkeys.txt): 0591 ; [.0000.0000.0000.0591] # HEBREW ACCENT ETNAHTA This is ignorable for three levels but not four levels. Therefore it is an “ignorable character” when you produce a weight string for one, two, or three levels. “Ignorable at level 1” means the level-1 weight is ignorable, as represented by 0000 in allkeys.txt. “Fully ignorable” means ignorable for all levels.
ISO 14651: The ISO/IEC 14651 “International String Ordering” standard.
LEVEL: A prioritization order for weights.
Each level has a name “level + number”, for example “level 1”, “level 2”, “level 3”, “level 4”. (Do not use, or rarely use, equivalent terms “primary”, “secondary”, “tertiary”, “quaternary”.) Typically level 1 is the character-differs level for
WHEREclauses, levels 2 and following are case-differs or accent-differs something-minor-differs levels which might be useful for
ORDER BYclauses. For example, from
0061 ; [.0FD0.0020.0002.0061] # LATIN SMALL LETTER A 24D0 ; [.0FD0.0020.0006.24D0] # CIRCLED LATIN SMALL LETTER A; 0041 ; [.0FD0.0020.0008.0041] # LATIN CAPITAL LETTER A
There are four levels here. Level 1 is always
0FD0for “A”. Level 2 is always
0020. Level 3 is
CAPITAL. Level 4 is the same as the Unicode code point value. Do not confuse “weight level” with “weighting level”.
ORDERING KEY: Do not use. Use “weight string”.
SORTKEY: Do not use. Use “weight string”.
SUBKEY: A sequence of weights for a single level.
UCA: Unicode Collation Algorithm as described in Unicode Technical Standard #10, http://www.unicode.org/reports/tr10.
WEIGHT: A positive numeric value used for comparisons.
Weights come from collating tables and go to weight strings. Often weight appears as a 4-digit number in collating tables. For example (from
0062 ; [.0FE6.0020.0002.0062] # LATIN SMALL LETTER B
This is the entry for collating element
0062, and there are 4 weights:
WEIGHT STRING: A binary string, sometimes called a “sortkey” or an “ordering key”, produced by taking a series of weights from a collating table for a certain number of levels, ordering them by position and level, and outputting.
For example: starting with a character string
ABC, and knowing that the number of levels is 2, look up the collating elements for
0041 ; [.0FD0.0020.0008.0041] # LATIN CAPITAL LETTER A 0042 ; [.0FE6.0020.0008.0042] # LATIN CAPITAL LETTER B 0043 ; [.0FFE.0020.0008.0043] # LATIN CAPITAL LETTER C
weight_string()function produces a weight string.
WEIGHTING ELEMENT: A sequence of weights, in ascending order by level.
For example, from
00DF ; [.11AF.0020.0004.00DF][.0000.0199.0004.00DF][.11AF.0020.001F.00DF] # LATIN SMALL LETTER SHARP S
There are three weighting elements in this example, each is surrounded by square brackets:
[.11AF.0020.0004.00DF] [.0000.0199.0004.00DF] [.11AF.0020.001F.00DF]
Often one collating element has only one weighting element (which has many weights), but
SHARP Sis an example of expansion.
ZERO WEIGHTS: The meaning is “an empty sequence of weights” (the ISO 14651 definition), not “weights with value 0000” (the UCA definition).
For example (from
0591 ; [.0000.0000.0000.0591] # HEBREW ACCENT ETNAHTA
There are three “empty sequences of weights” here, all of which look like
0000, which we interpret as code for “empty”.