WL#5210: German collation

Affects: Server-Prototype Only — Status: Un-Assigned

Description
Dependent Tasks
High Level Architecture

Adopt German collations which follow standards
and are consistent across many character sets.
Start with new German collations for latin9.

WL#5170: Swedish collation

Two collations for latin9 which are
based on Unicode UCA and CLDR German tailoring.
Like latin1_german_ci and latin1_german2_ci
but without the bugs and more like the standards.

Principles
----------

New collations are based on Unicode Collation
Algorithm (UCA), and are tailored according to
Common Locale Data Repository (CLDR)
from the Unicode site. Fuller description of
the principles is in WL#5170 Swedish collation.

Names
-----

Since the convention is
character set name _ language name _ UCA version _ case-insensitivity abbreviation
the new collations are
latin9_german_520_ci
latin9_german2_520_ci

The Rules
---------

The tailoring rules come from the CLDR file de.xml,
attached to this worklog task, or through these steps:
Go to http://cldr.unicode.org/
Click "CLDR Releases/Downloads"
Click "CLDR 1.7.2"
Click "core.zip"
Unzip core.zip
Copy ./common/collation/de.xml

Remember that, according to "Principles",
any ligatures are sorted as equal to the
first character of the expansion, because
we want to keep the collations simple
(one weight per character, primary weights only).

So Peter Gulutzan thinks these are the rules:

For de.xml collation_type="standard" i.e. latin9_german_520_ci:

No special rules. The DUCET UCA should take care of
Ä = A
Ö = O
Ü = U
ß = S

This passage in the CLDR for collation_type="standard" is hard to understand:
"
ae
æ
Æ
"
For our limited purpose, it appears we must say Æ = A.

For de.xml collation_type="phonebook" i.e. latin9_german2_520_ci:
Æ = AE
Ä = AE
Œ = OE
Ö = OE
Ü = UE
ß = SS

Oops, latin9_german2_520_ci is not a 'simplified' collation.
That seems reasonable, since 'german2' is meaningless unless
we allow expansions of Ä Ö Ü ß, as we do in latin1_german2_ci.
But then what about the other possible expansions,
HORIZONTAL ELLIPSIS
VULGAR FRACTION ONE QUARTER
VULGAR FRACTION ONE HALF
VULGAR FRACTION THREE QUARTERS
TRADE MARK SIGN
Until we decide what the "principle" is for this case.
we'll treat those characters as we do for other latin9 collations.

Other German tasks are irrelevant
---------------------------------

We know that we are not moving any nearer to
WL#1287 REAL DIN-1 German collation, or
WL#4013 Unicode german2 collation.
Try to remember that the objective is to have a
simplified one-weight-per-character collation
which doesn't have the bugs of the old latin1
collations, and follows UCA + CLDR for primary
weighting when that doesn't conflict with the
simplification.

The complete character list
---------------------------

See section "The complete character list" in
WL#5170 Swedish collation. The only different
weights are for the characters Æ Ä Œ Ö Ü ß
as described in section "Rules" above,
and we won't use any of the Swedish tailoring.

Some Problems
-------------

Before we can accept this task, we need to agree:
* The "Principles" agreed for WL#5170 are okay generally.
* Expansions do / don't apply for HORIZONTAL ELLIPSIS etc.
* The "Rules" section above correctly reflects CLDR de.xml
* The CLDR de.xml does not contain errors.