Add collations for Unicode character sets with rules for 'german2', sometimes known as 'German phone book'.
Add collations for Unicode character sets where the rules are as in unicode_ci collations with tailoring for 'german2', sometimes known as 'German phone book'. This affects six characters, ä ö ü æ œ ß. The tailoring rules nearly reflect the CLDR (Common Locale Data Repository) and the DIN-2 (also known as phone-book-order) German collation. The standard DIN-1 (also known as dictionary-order) collation is already okay because it's utf8_unicode_ci. The new collations will be: utf8_german2_ci ucs2_german2_ci utf8mb4_german2_ci utf16_german2_ci utf32_german2_ci Not WL#2673 ----------- WL#2673 "Unicode Collation Algorithm new version" will also occur. However, the WL#2673 "Principles" apply for newer collations. For utf8_german2_ci etc. we are only trying to complete the set of older collations which are based on Unicode 3.0. So we do not need to follow the Principles, naming conventions, or new rules for contractions and for case folding as in WL#2673. There will be a separate set of German collations, utf8_german2_520_ci etc., which will follow WL#2673 and therefore will not be the same as utf8_german2_ci etc. UCA --- UCA version: 3.0. Exact correspondence with UCA rules is not required, since the requirement is for correspondence with older MySQL unicode_ci collations. CLDR ---- CLDR version: 1.8.0 (2010-03-17). Download from: http://cldr.unicode.org/index/downloads (use core.zip) CLDR file: de.xml. See de.xml file attached to this task. CLDR variant collation type in de.xml file, if any: "phonebook". Other implementations --------------------- Alexander Barkov maintains a site showing how other DBMS vendors or OS vendors collate. The Oracle10g German collation is: http://www.collation-charts.org/oracle10g/ora10g.WE8MSWIN1252.XGERMAN_DIN.html The Microsoft SQL Server German collation is: http://www.collation-charts.org/mssql/mssql.10407.1252.German_PhoneBook_CI_AI.html The MySQL 5.1 collation latin1_german2_ci collation also exists. The utf8 german2 collations will differ from latin1_german2_ci for æ and œ. The Rules --------- The requirement is that utf8_german2_ci etc. will cause returning exactly the same results as utf8_unicode_ci etc., except for characters described in this section. These are all the de.xml rules, expressed as Unicode names with the symbol "=" meaning "primary-level equivalent". The columns "DIN-2", "Microsoft", "Oracle", and "MySQL" contain "yes" when there is agreement with de.xml, "no" when there is disagreement. The rule for Sharp S is not in de.xml but is basic UCA. The capital letter Sharp S is not changed. DIN2 Microsoft Oracle MySQL ---- --------- ------ ----- Ä/ä 00C4/00E4 LETTER A WITH DIAERESIS = AE yes yes yes yes Ö/ö 00D6/00F6 LETTER O WITH DIAERESIS = OE yes yes yes yes Ü/ü 00DC/00FC LETTER U WITH DIAERESIS = UE yes yes yes yes Æ/æ 00C6/00E6 LETTER AE = AE yes yes no no Œ/œ 0152/0153 LIGATURE OE = OE yes yes no no /ß /00DF LETTER SHARP S = SS yes yes yes yes Tailoring, derived from de.xml, including secondary and tertiary difference, looks like this: &AE << æ <<< Æ << ä <<< Ä &OE << œ <<< Œ << ö <<< Ö &UE << ü <<< Ü or, in terms of character codes: &AE << \\u00E6 <<< \\u00C6 << \\u00E4 <<< \\u00C4 &OE << \\u0153 <<< \\u0152 << \\u00F6 <<< \\u00D6 &UE << \\u00FC <<< \\u00DC Since most of these rules already apply for MySQL latin1_german2_ci, it is possible to convert utf8 to latin1 and then sort. For example: "order by cast(utf8_col as char(2) character set latin1) collate latin1_german2_ci" But a built-in collation is more useful. References ---------- WL#2673 Unicode Collation Algorithm new version WL#5170 Swedish collation dev-private email thread "Sorting German2" dev-private email thread "WL#4013 Unicode german2 collation" References ---------- - "How to create new collations?" http://forums.mysql.com/read.php?103,52871,52871 - Feature request at BUG#38758 "utf8_german2_ci"