WL#5210: German collation
Affects: Server-Prototype Only — Status: Un-Assigned — Priority: Very High
Adopt German collations which follow standards and are consistent across many character sets. Start with new German collations for latin9.
Two collations for latin9 which are based on Unicode UCA and CLDR German tailoring. Like latin1_german_ci and latin1_german2_ci but without the bugs and more like the standards. Principles ---------- New collations are based on Unicode Collation Algorithm (UCA), and are tailored according to Common Locale Data Repository (CLDR) from the Unicode site. Fuller description of the principles is in WL#5170 Swedish collation. Names ----- Since the convention is character set name _ language name _ UCA version _ case-insensitivity abbreviation the new collations are latin9_german_520_ci latin9_german2_520_ci The Rules --------- The tailoring rules come from the CLDR file de.xml, attached to this worklog task, or through these steps: Go to http://cldr.unicode.org/ Click "CLDR Releases/Downloads" Click "CLDR 1.7.2" Click "core.zip" Unzip core.zip Copy ./common/collation/de.xml Remember that, according to "Principles", any ligatures are sorted as equal to the first character of the expansion, because we want to keep the collations simple (one weight per character, primary weights only). So Peter Gulutzan thinks these are the rules: For de.xml collation_type="standard" i.e. latin9_german_520_ci: No special rules. The DUCET UCA should take care of Ä = A Ö = O Ü = U ß = S This passage in the CLDR for collation_type="standard" is hard to understand: " <reset>ae</reset> <s>æ</s> <t>Æ</t> " For our limited purpose, it appears we must say Æ = A. For de.xml collation_type="phonebook" i.e. latin9_german2_520_ci: Æ = AE Ä = AE Œ = OE Ö = OE Ü = UE ß = SS Oops, latin9_german2_520_ci is not a 'simplified' collation. That seems reasonable, since 'german2' is meaningless unless we allow expansions of Ä Ö Ü ß, as we do in latin1_german2_ci. But then what about the other possible expansions, HORIZONTAL ELLIPSIS VULGAR FRACTION ONE QUARTER VULGAR FRACTION ONE HALF VULGAR FRACTION THREE QUARTERS TRADE MARK SIGN Until we decide what the "principle" is for this case. we'll treat those characters as we do for other latin9 collations. Other German tasks are irrelevant --------------------------------- We know that we are not moving any nearer to WL#1287 REAL DIN-1 German collation, or WL#4013 Unicode german2 collation. Try to remember that the objective is to have a simplified one-weight-per-character collation which doesn't have the bugs of the old latin1 collations, and follows UCA + CLDR for primary weighting when that doesn't conflict with the simplification. The complete character list --------------------------- See section "The complete character list" in WL#5170 Swedish collation. The only different weights are for the characters Æ Ä Œ Ö Ü ß as described in section "Rules" above, and we won't use any of the Swedish tailoring. Some Problems ------------- Before we can accept this task, we need to agree: * The "Principles" agreed for WL#5170 are okay generally. * Expansions do / don't apply for HORIZONTAL ELLIPSIS etc. * The "Rules" section above correctly reflects CLDR de.xml * The CLDR de.xml does not contain errors.
Copyright (c) 2000, 2017, Oracle Corporation and/or its affiliates. All rights reserved.