This section contains a list of terms and definitions used in the context of collations.
allkeys.txt: An example of a series of collating-table entries as defined by UCA ( Unicode Collation Algorithm).
Actually UCA says
allkeys.txt is a
“collation element table”, that is, it is the
part of a collating table which shows collating elements.
COLLATING ELEMENT: The unit which linguistically-aware users perceive as the minimal building block in string comparisons.
Usually there is a one-to-one relation between characters and collating elements, for example in English there is a character “A” and a collating element for “A”. More rarely there is a many-to-one relation, for example in traditional Spanish the two-character combination “LL” is a single collating element.
Usually there is a one-to-many relation between collating elements and weights (because there are multiple levels); however, for an ignorable character, one collating element has zero weights.
COLLATING TABLE: A table which describes all the rules for a collation, including Posix-like “Locale” declarations and a list of collating elements.
Here are entries for collating elements from two sources,
ISO 14651 and
[From ISO 14651] <U0024> <S2C4>;<BASE>;<MIN>;<U0024> % DOLLAR SIGN <UFF04> <S2C4>;<BASE>;<WIDE>;<UFF04> % FULLWIDTH DOLLAR SIGN <UFE69> <S2C4>;<BASE>;<SMALL>;<UFE69> % SMALL DOLLAR SIGN
[From allkeys.txt 4.0] 0024 ; [.0E0F.0020.0002.0024] # DOLLAR SIGN FF04 ; [.0E0F.0020.0003.FF04] # FULLWIDTH DOLLAR SIGN; QQK FE69 ; [.0E0F.0020.000F.FE69] # SMALL DOLLAR SIGN; QQK
Clearly these are the same thing, but ISO 14651 uses names
(e.g. “BASE”) where
allkeys.txt uses numbers (e.g.
0020). So ISO 14651 had to define earlier
in its table
BASE = 0020; MIN =
0002; WIDE = 0003; SMALL= 000F etc.
COLLATION ELEMENT: Do not use. Use collating element.
COLLATION TABLE: Do not use. Use collating table.
COLLATING TABLE ENTRY: A line in a collating table, representing one fact.
Each “line” in
(which is a subset of a collating table) is an entry for one
CONTRACTION: A mapping from
N characters to
N collation elements.
Contraction is rare, for example the character
“C” has one collation element “C”.
But take an example from traditional Spanish:
“LL” is a single collation element between
“L” and “M”. Contraction also
occurs when there has been decomposition. For example here
are two collating element entries (from
0622 ; [.15E2.0020.0002.0622] # ARABIC LETTER ALEF WITH MADDA ABOVE 0627 0653 ; [.15E2.0020.0002.0622] # ARABIC LETTER ALEF WITH MADDA ABOVE
Notice that there is one collation element labelled
0627 0653, which clearly is the result of
mapping from two characters
U+0627 ARABIC LETTER
U+0653 ARABIC MADDAH
ABOVE, with the same weights as the composed
U+0622 ARABIC LETTER ALEF WITH MADDA
EXPANSION: A one-to-many mapping from collating element to weighting levels.
For example, German Sharp S may be treated as
“ss”, so the
allkeys.txt entry for
00DF (Sharp S) is:
00DF ; [.11AF.0020.0004.00DF][.0000.0199.0004.00DF][.11AF.0020.001F.00DF] # LATIN SMALL LETTER SHARP S;
The entry for “s” alone is:
0073 ; [.11AF.0020.0002.0073] # LATIN SMALL LETTER S
0000 means ignorable, two-level
weight strings are:
11AF 11AF 0020 0199 0020 /* for SHARP S */ 11AF 11AF 0020 0020 /* for 'ss' */
IGNORABLE CHARACTER: A character which has one collating element which has no significance for comparison. One ignorable character has one collating element, but zero weights at all levels. For example (from allkeys.txt): 0591 ; [.0000.0000.0000.0591] # HEBREW ACCENT ETNAHTA This is ignorable for three levels but not four levels. Therefore it is an “ignorable character” when you produce a weight string for one, two, or three levels. “Ignorable at level 1” means the level-1 weight is ignorable, as represented by 0000 in allkeys.txt. “Fully ignorable” means ignorable for all levels.
ISO 14651: The ISO/IEC 14651 “International String Ordering” standard.
LEVEL: A prioritization order for weights.
Each level has a name “level + number”, for
example “level 1”, “level 2”,
“level 3”, “level 4”. (Do not use,
or rarely use, equivalent terms “primary”,
“quaternary”.) Typically level 1 is the
character-differs level for
clauses, levels 2 and following are case-differs or
accent-differs something-minor-differs levels which might be
ORDER BY clauses. For example,
0061 ; [.0FD0.0020.0002.0061] # LATIN SMALL LETTER A 24D0 ; [.0FD0.0020.0006.24D0] # CIRCLED LATIN SMALL LETTER A; 0041 ; [.0FD0.0020.0008.0041] # LATIN CAPITAL LETTER A
There are four levels here. Level 1 is always
0FD0 for “A”. Level 2 is
0020. Level 3 is
Level 4 is the same as the Unicode code point value. Do not
confuse “weight level” with “weighting
ORDERING KEY: Do not use. Use “weight string”.
SORTKEY: Do not use. Use “weight string”.
SUBKEY: A sequence of weights for a single level.
UCA: Unicode Collation Algorithm as described in Unicode Technical Standard #10, http://www.unicode.org/reports/tr10.
WEIGHT: A positive numeric value used for comparisons.
Weights come from collating tables and go to weight strings.
Often weight appears as a 4-digit number in collating
tables. For example (from
0062 ; [.0FE6.0020.0002.0062] # LATIN SMALL LETTER B
This is the entry for collating element
0062, and there are 4 weights:
WEIGHT STRING: A binary string, sometimes called a “sortkey” or an “ordering key”, produced by taking a series of weights from a collating table for a certain number of levels, ordering them by position and level, and outputting.
For example: starting with a character string
ABC, and knowing that the number of
levels is 2, look up the collating elements for
0041 ; [.0FD0.0020.0008.0041] # LATIN CAPITAL LETTER A 0042 ; [.0FE6.0020.0008.0042] # LATIN CAPITAL LETTER B 0043 ; [.0FFE.0020.0008.0043] # LATIN CAPITAL LETTER C
function produces a weight string.
WEIGHTING ELEMENT: A sequence of weights, in ascending order by level.
For example, from
00DF ; [.11AF.0020.0004.00DF][.0000.0199.0004.00DF][.11AF.0020.001F.00DF] # LATIN SMALL LETTER SHARP S
There are three weighting elements in this example, each is surrounded by square brackets:
[.11AF.0020.0004.00DF] [.0000.0199.0004.00DF] [.11AF.0020.001F.00DF]
Often one collating element has only one weighting element
(which has many weights), but
SHARP S is
an example of expansion.
ZERO WEIGHTS: The meaning is “an empty sequence of weights” (the ISO 14651 definition), not “weights with value 0000” (the UCA definition).
For example (from
0591 ; [.0000.0000.0000.0591] # HEBREW ACCENT ETNAHTA
There are three “empty sequences of weights”
here, all of which look like
we interpret as code for “empty”.