This section contains a list of terms and definitions used in the context of collations.
allkeys.txt: An example of a series of collating-table entries as defined by UCA ( Unicode Collation Algorithm).
Actually UCA says allkeys.txt is a
“collation element table”, that is, it is the
part of a collating table which shows collating elements.
COLLATING ELEMENT: The unit which linguistically-aware users perceive as the minimal building block in string comparisons.
Usually there is a one-to-one relation between characters and collating elements, for example in English there is a character “A” and a collating element for “A”. More rarely there is a many-to-one relation, for example in traditional Spanish the two-character combination “LL” is a single collating element.
Usually there is a one-to-many relation between collating elements and weights (because there are multiple levels); however, for an ignorable character, one collating element has zero weights.
COLLATING TABLE: A table which describes all the rules for a collation, including Posix-like “Locale” declarations and a list of collating elements.
Here are entries for collating elements from two sources,
ISO 14651 and
allkeys.txt:
[From ISO 14651] <U0024> <S2C4>;<BASE>;<MIN>;<U0024> % DOLLAR SIGN <UFF04> <S2C4>;<BASE>;<WIDE>;<UFF04> % FULLWIDTH DOLLAR SIGN <UFE69> <S2C4>;<BASE>;<SMALL>;<UFE69> % SMALL DOLLAR SIGN
COLLATION ELEMENT: Do not use. Use collating element.
COLLATION TABLE: Do not use. Use collating table.
COLLATING TABLE ENTRY: A line in a collating table, representing one fact.
Each “line” in allkeys.txt
(which is a subset of a collating table) is an entry for one
collating element.
CONTRACTION: A mapping from
N characters to
less-than-N collation elements.
Contraction is rare, for example the character
“C” has one collation element “C”.
But take an example from traditional Spanish:
“LL” is a single collation element between
“L” and “M”. Contraction also
occurs when there has been decomposition. For example here
are two collating element entries (from
allkeys.txt 5.0.0):
0622 ; [.15E2.0020.0002.0622] # ARABIC LETTER ALEF WITH MADDA ABOVE 0627 0653 ; [.15E2.0020.0002.0622] # ARABIC LETTER ALEF WITH MADDA ABOVE
EXPANSION: A one-to-many mapping from collating element to weighting levels.
For example, German Sharp S may be treated as
“ss”, so the
allkeys.txt entry for
collating element 00DF (Sharp S) is:
00DF ; [.11AF.0020.0004.00DF][.0000.0199.0004.00DF][.11AF.0020.001F.00DF] # LATIN SMALL LETTER SHARP S;
IGNORABLE CHARACTER: A character which has one collating element which has no significance for comparison. One ignorable character has one collating element, but zero weights at all levels. For example (from allkeys.txt): 0591 ; [.0000.0000.0000.0591] # HEBREW ACCENT ETNAHTA This is ignorable for three levels but not four levels. Therefore it is an “ignorable character” when you produce a weight string for one, two, or three levels. “Ignorable at level 1” means the level-1 weight is ignorable, as represented by 0000 in allkeys.txt. “Fully ignorable” means ignorable for all levels.
ISO 14651: The ISO/IEC 14651 “International String Ordering” standard.
Draft documents:
LEVEL: A prioritization order for weights.
Each level has a name “level + number”, for
example “level 1”, “level 2”,
“level 3”, “level 4”. (Do not use,
or rarely use, equivalent terms “primary”,
“secondary”, “tertiary”,
“quaternary”.) Typically level 1 is the
character-differs level for WHERE
clauses, levels 2 and following are case-differs or
accent-differs something-minor-differs levels which might be
useful for ORDER BY clauses. For example,
from allkeys.txt 5.0.0:
0061 ; [.0FD0.0020.0002.0061] # LATIN SMALL LETTER A 24D0 ; [.0FD0.0020.0006.24D0] # CIRCLED LATIN SMALL LETTER A; 0041 ; [.0FD0.0020.0008.0041] # LATIN CAPITAL LETTER A
ORDERING KEY: Do not use. Use “weight string”.
SORTKEY: Do not use. Use “weight string”.
SUBKEY: A sequence of weights for a single level.
UCA: Unicode Collation Algorithm as described in Unicode Technical Standard #10, http://www.unicode.org/reports/tr10.
WEIGHT: A positive numeric value used for comparisons.
Weights come from collating tables and go to weight strings.
Often weight appears as a 4-digit number in collating
tables. For example (from
allkeys.txt):
0062 ; [.0FE6.0020.0002.0062] # LATIN SMALL LETTER B
WEIGHT STRING: A binary string, sometimes called a “sortkey” or an “ordering key”, produced by taking a series of weights from a collating table for a certain number of levels, ordering them by position and level, and outputting.
For example: starting with a character string
ABC, and knowing that the number of
levels is 2, look up the collating elements for
A and B and
C in allkeys.txt
5.0.0:
0041 ; [.0FD0.0020.0008.0041] # LATIN CAPITAL LETTER A 0042 ; [.0FE6.0020.0008.0042] # LATIN CAPITAL LETTER B 0043 ; [.0FFE.0020.0008.0043] # LATIN CAPITAL LETTER C
WEIGHTING ELEMENT: A sequence of weights, in ascending order by level.
For example, from allkeys.txt 5.0.0:
00DF ; [.11AF.0020.0004.00DF][.0000.0199.0004.00DF][.11AF.0020.001F.00DF] # LATIN SMALL LETTER SHARP S
ZERO WEIGHTS: The meaning is “an empty sequence of weights” (the ISO 14651 definition), not “weights with value 0000” (the UCA definition).
For example (from allkeys.txt):
0591 ; [.0000.0000.0000.0591] # HEBREW ACCENT ETNAHTA

User Comments
Add your own comment.