WL#4013: Unicode german2 collation

Affects: Server-5.6   —   Status: Complete   —   Priority: Very High

Add collations for Unicode character sets with rules
for 'german2', sometimes known as 'German phone book'.
Add collations for Unicode character sets where
the rules are as in unicode_ci collations
with tailoring for 'german2', sometimes known as
'German phone book'.

This affects six characters, ä ö ü æ œ ß.
The tailoring rules nearly reflect the CLDR
(Common Locale Data Repository) and the DIN-2
(also known as phone-book-order) German collation.
The standard DIN-1 (also known as dictionary-order)
collation is already okay because it's utf8_unicode_ci.

The new collations will be:

Not WL#2673

WL#2673 "Unicode Collation Algorithm new version"
will also occur. However, the WL#2673 "Principles"
apply for newer collations. For utf8_german2_ci etc.
we are only trying to complete the set of older
collations which are based on Unicode 3.0. So we
do not need to follow the Principles, naming
conventions, or new rules for contractions and for
case folding as in WL#2673. 

There will be a separate set of German collations,
utf8_german2_520_ci etc., which will follow WL#2673
and therefore will not be the same as utf8_german2_ci etc.


UCA version: 3.0.
Exact correspondence with UCA rules is not required,
since the requirement is for correspondence with older
MySQL unicode_ci collations.


CLDR version: 1.8.0 (2010-03-17).
Download from: http://cldr.unicode.org/index/downloads (use core.zip)
CLDR file: de.xml. See de.xml file attached to this task.
CLDR variant collation type in de.xml file, if any: "phonebook".

Other implementations

Alexander Barkov maintains a site showing
how other DBMS vendors or OS vendors collate.
The Oracle10g German collation is:
The Microsoft SQL Server German collation is:

The MySQL 5.1 collation latin1_german2_ci collation also exists.
The utf8 german2 collations will differ from latin1_german2_ci
for æ and œ.

The Rules

The requirement is that utf8_german2_ci etc. will cause
returning exactly the same results as utf8_unicode_ci
etc., except for characters described in this section.

These are all the de.xml rules, expressed as Unicode
names with the symbol "=" meaning "primary-level equivalent".

The columns "DIN-2", "Microsoft", "Oracle", and "MySQL"
contain "yes" when there is agreement with de.xml,
"no" when there is disagreement.
The rule for Sharp S is not in de.xml but is basic UCA.
The capital letter Sharp S is not changed.

                                           DIN2 Microsoft Oracle MySQL
                                           ---- --------- ------ -----
Ä/ä 00C4/00E4 LETTER A WITH DIAERESIS = AE yes  yes       yes    yes 
Ö/ö 00D6/00F6 LETTER O WITH DIAERESIS = OE yes  yes       yes    yes
Ü/ü 00DC/00FC LETTER U WITH DIAERESIS = UE yes  yes       yes    yes
Æ/æ 00C6/00E6 LETTER AE = AE               yes  yes       no     no
Œ/œ 0152/0153 LIGATURE OE = OE             yes  yes       no     no
 /ß     /00DF LETTER SHARP S = SS          yes  yes       yes    yes

Tailoring, derived from de.xml, including secondary and tertiary difference,
looks like this:

&AE << æ <<< Æ << ä <<< Ä
&OE << œ <<< Œ << ö <<< Ö
&UE << ü <<< Ü

or, in terms of character codes:

&AE << \\u00E6 <<< \\u00C6 << \\u00E4 <<< \\u00C4
&OE << \\u0153 <<< \\u0152 << \\u00F6 <<< \\u00D6
&UE << \\u00FC <<< \\u00DC

Since most of these rules already apply for MySQL latin1_german2_ci,
it is possible to convert utf8 to latin1 and then sort.
For example:
"order by cast(utf8_col as char(2) character set latin1)
  collate latin1_german2_ci"
But a built-in collation is more useful.


WL#2673 Unicode Collation Algorithm new version
WL#5170 Swedish collation
dev-private email thread "Sorting German2"
dev-private email thread "WL#4013 Unicode german2 collation"


- "How to create new collations?"
- Feature request at BUG#38758 "utf8_german2_ci"