WL#5170: Swedish collation

Affects: Server-Prototype Only — Status: Un-Assigned

Description
High Level Architecture

Adopt a Swedish collation which follows standards
and is consistent across many character sets.
Start with a new Swedish collation for latin1.

The main recommendation for all new MySQL
collations is: follow the Default Unicode
Collation Element table (DUCET). This worklog
task description is only about the departures
from DUCET which are necessary for simplicity
and Swedishness, in other words "tailoring".

Why not fix latin1_swedish_ci?
------------------------------

We already have "Swedish" collations, but they
have imperfections. In particular:
latin1_swedish_ci is not only "wrong" (i.e. non-standard/unexpected)
for the backslash character mentioned in BUG#46659, it's just as
"wrong" for real alphabetic characters (oe ligature, z caron,
sharp s, y diaeresis, thorn, s caron, o with stroke) and for many
punctuation or special characters
http://www.collation-charts.org/mysql60/mysql604.latin1_swedish_ci.html

However, we cannot fix an existing collation without
affecting indexes drastically. We tried to do that
with a small change in utf8_general_ci for SHARP S.
The results were terrible. We won't do that again.

So we have to have one or more new Swedish collations.
The old collations will remain, they will not be
deprecated, and latin1_swedish_ci will continue to be
the default for the latin1 character set.

Name
----

WL#2673 "Unicode Collation Algorithm new version"
proposes names like utf8_swedish_500_ci. Since
the Unicode version will probably be 5.2 not 5.0,
a name like utf8_swedish_520_ci is likely. So a
latin1 "equivalent" would be latin1_swedish_520_ci.

We once had a complaint from somebody who thought
utf8_general_ci must have the same behaviour as
latin1_general_ci, since they're both named 'general'.
Somebody might apply the same pseudo-logic to our
Swedish collation names.

Simple Collation
----------------

We will follow current rules described in
WL#2673 Unicode Collation Algorithm new version,
section "Simple collations". That section
includes the rules for expansions and ignorables.

CLDR
----

The CLDR (Common Locale Data Repository) is
available on the Unicode site for download.

CLDR is clearly based on official standards,
CLDR seems to be more up to date than e.g. Posix,
CLDR is the basis for major products like ICU,
CLDR is easy to acquire and read.
So in general MySQL should take CLDR seriously.

The particular document we're taking from the CLDR
repository is collation/sv.xml. The "sv" means "Swedish".
It describes two types, "standard" (where w=v) and
"reformed" (where w<>v). We care about "reformed".

Quotation from the 1982 Swedish standard
----------------------------------------

4.Swed.1982. Svensk Standard 03 81 04:
Dokumentation – Administrativ filering – Alfanumerisk sortering
(Documentation – Administrative filing rules – Alphanumerical ordering).
1. ed. (1982-06-25), 4.2.4:
“Specialbokstäver i språkskrivna med det latinska alfabetet
konverteras vid filering till en eller flera av de latinska
bokstäverna enligt följande tabell:
isländskt ð, Ð = d ... polskt ł, Ł = l ... serbokroatiskt đ, Đ = d
... samiskt ŋ = n ... turkiskt ı = i ... isländskt þ = th
... grönländskt ĸ = k ... tyskt ü = y ... danskt, norskt ø = ö
... ungerskt ő  = y ... ungerskt ű = y”] 
quoted in http://www.evertype.com/standards/wynnyogh/thorn.html

Translating the tabular part alone, we have:
ETH "Icelandic ð, Ð" = d
D WITH STROKE "Serbocroatian đ, Đ"  = d
DOTLESS I "Turkish ı" = i
KRA "Greenlandic ĸ" U0138 = k
O WITH STROKE "Danish, Norwegian ø" = ö
L WITH STROKE "Polish ł, Ł" = l
ENG "Sami ŋ" U0148, U014A = n
THORN "Icelandic þ" = th
U WITH DIAERESIS "German ü" = y
O WITH DOUBLE ACUTE "Hungarian ő" = y
U WITH DOUBLE ACUTE "Hungarian ű" = y

The suggestions for DOTLESS I and KRA and ENG
and O WITH DOUBLE ACUTE are different from sv.xml.
The suggestion for O WITH DOUBLE ACUTE may be
a typo. We don't see all the rules here, for
example the recent change concerning w <> v.
We should get a full copy of the current
standard from sis.se (about 60 euros).

Also could somebody please check these references re V <> W:
http://www.svenskaakademien.se/web/Svenska_Akademiens_ordlista.aspx
http://www.dn.se/dnbok/alfabetet-blir-langre-vaxer-med-w-1.666436

Other implementations
---------------------

Alexander Barkov maintains a site showing
how other DBMS vendors or OS vendors collate.
The Oracle10g Swedish collation is:
http://www.collation-charts.org/oracle10g/ora10g.WE8MSWIN1252.SWEDISH.html
The Microsoft Vista Swedish collation is:
http://www.collation-charts.org/vista/vista.041D.CP1252.Swedish_Sweden.html

The Rules
---------

These are all the sv.xml rules, expressed as Unicode
names with the symbol "=" meaning "primary-level equivalent".
The columns "Swed 1982", "Microsoft" and "Oracle"
contain "yes" when there is agreement with sv.xml,
"no" when there is disagreement, and "-" when there
is no comparison (because we don't have the complete
standard, and because the Microsoft/Oracle collations
are only for an 8-bit character set).


                                Swed 1982   Microsoft Oracle
                                ---------   --------- ------

A RING BEFORE EZH               -           yes       yes
A DIAERESIS AFTER A RING        -           yes       yes
O DIAERESIS AFTER A DIAERESIS   -           yes       yes
ETH = D                         yes         yes       yes
D STROKE = D                    yes         -         -
THORN = TH                      yes         yes (?)   no, = T
O STROKE = O DIAERESIS          yes         yes       yes
AE = A DIAERESIS                -           no        yes
O DOUBLE ACUTE = O DIAERESIS    yes         -         -
U DIAERESIS = Y                 yes         yes       yes
U DOUBLE ACUTE = Y              no (typo?)  -         -
L STROKE = L                    yes         -         -
A DIAERESIS = E OGONEK          no          -         -
OE = O DIAERESIS                no          no        no
O CIRCUMFLEX = O DIAERESIS      no          no        no


The first three rules -- A RING BEFORE EZH, A DIAERESIS BEFORE A RING,
O DIAERESIS AFTER A DIAERESIS -- are basic; this is what every Swede
would agree to without question. EZH is the first real letter after
all the variants of Z in the DUCET, so "before EZH" is just a formal
way to say "after Z, and after Z caron, and after Visigothic Z, etc.".

ETH = D is also generally agreed, indeed it's in the DUCET now.
D WITH STROKE = D is probably here because D WITH STROKE looks
very much like ETH, not because it's Swedish.

THORN = TH is an expansion, so we cannot follow this rule with a
simple collation (with a simple collation we take the first expanded
letter which is 'T'). There does appear to be agreement that THORN
should be with 'T', or a separate letter between 'T' and 'U', or = 'TH'.

O STROKE = O DIAERESIS treats a Danish/Norwegian letter, everyone agrees.

AE = A DIAERESIS treats a Danish/Norwegian letter, and one could
have expected that everyone would agree about this one too. It is
a mystery that Microsoft treats AE in the non-Scandinavian way.

O DOUBLE ACUTE = O DIAERESIS may be taking into account that,
in Swedish handwriting, ö may look like ő, according to
http://en.wikipedia.org/wiki/Double_acute_accent

U DIAERESIS = Y treats a German letter, everyone agrees.

U DOUBLE ACUTE = Y is consistent with the "Swedish handwriting"
considerations described for O DOUBLE ACUTE, and the U DIAERESIS
rule.

L STROKE = L is just basic DUCET, nowadays. It's shown separately
because the Swedish standard document mentioned it, and it wasn't
in the older versions of the DUCET.

A DIAERESIS = E OGONEK will confuse you if you only look at the
official Unicode name for the character, because ogoneks are Polish.
In reality this concerns E CAUDATA http://en.wikipedia.org/wiki/E_caudata
which looks the same as E OGONEK but has a different heritage,
from Old Norse.

OE = O DIAERESIS might be controversial because most people,
probably including most Swedes nowadays, think of the OE ligature
as something French. But actually this is the letter "ethel", and
again we can turn to Wikipedia for explanation:
"Œ is used in the modern scholarly orthography of Old West Norse,
representing the long vowel /øː/, contrasting with ø, which represents
the short vowel /ø/." http://en.wikipedia.org/wiki/%C5%92
Well, given that ø i.e. O STROKE is accepted to be equal to O DIAERESIS,
this makes sense after all.

O CIRCUMFLEX = O DIAERESIS defies easy explanation.
This time Wikipedia has only a vague mention that might justify:
"In Swedish, when transcribing dialectal speech, the circumflex is
often used to denote an a or o which is pronounced dialectally as if
it has been written ä [æ] or ö [ø]." http://en.wikipedia.org/wiki/Circumflex
Okay, but if that's the explanation, then it's odd that we don't
see the same thing for A CIRCUMFLEX.

Most of the above rules, including the one for O CIRCUMFLEX, are also in
sv_SE.UTF-8.src from posix.zip. But not all of them are in the "real"
Posix list http://www.collation-charts.org/fc6/fc6.sv_SE.iso88591.html.

Existing collations
-------------------

Existing Swedish collations like latin1_swedish_ci
will continue to be supported indefinitely. In fact
latin1_swedish_ci will continue to be the default
collation for latin1.

BUG#36144 "Add latin1_swedish_cs collation" won't happen.

Non-Swedish
-----------

This will not become the collation for Finnish. Although
in the past it was common to have a Swedish/Finnish
combined collation, the CLDR fi.xml differs from
the CLDR sv.xml in significant ways.

The new rules for OE and O CIRCUMFLEX are inappropriate
for French. This new collation, unlike latin1_swedish_ci,
will not be recommended for general use outside Sweden.

The complete character list
---------------------------

The rest of this document is latin1_swedish_5xx_ci detail.

Here is the character order, showing UCA 5.2 primary weights,
(for a simple collation we have to replace them with 1-byte weights),
with '*' meaning optional ignorable
with '!' meaning there is a difference from UCA 5.2, e.g. Swedish tailoring.

latin1 ucs2   weight    name

00     0000 ! 0000      (control)
01     0001 ! 0001      (control)
02     0002 ! 0002      (control)
03     0003 ! 0003      (control)
04     0004 ! 0004      (control)
05     0005 ! 0005      (control)
06     0006 ! 0006      (control)
07     0007 ! 0007      (control)
08     0008 ! 0008      (control)
0e     000e !*000e      (control)
0f     000f !*000f      (control)
10     0010 ! 0010      (control)
11     0011 ! 0011      (control)
12     0012 ! 0012      (control)
13     0013 ! 0013      (control)
14     0014 ! 0014      (control)
15     0015 ! 0015      (control)
16     0016 ! 0016      (control)
17     0017 ! 0017      (control)
18     0018 ! 0018      (control)
19     0019 !*0019      (control)
1a     001a !*001a      (control)
1b     001b !*001b      (control)
1c     001c !*001c      (control)
1d     001d !*001d      (control)
1e     001e !*001e      (control)
1f     001f !*001f      (control)
7f     007F !*0020      DELETE
81     0081 !*0021      (control)
8d     008d ! 0022      PARTIAL LINE FEED
8f     008f ! 0023      SINGLE SHIFT THREE
90     0090 ! 0024      DEVICE CONTROL STRING
9d     009d ! 0025      OPERATING SYSTEM COMMAND
9d     009d ! 0025      OPERATING SYSTEM COMMAND
09     0009 !*0201      (control) HORIZONTAL TABULATION
0a     000a !*0202      (control) LINE FEED
0b     000b !*0203      (control) VERTICAL TABULATION
0c     000c !*0204      (control) FORM FEED
0d     000d !*0205      (control) CARRIAGE RETURN
20     0020  *020a      SPACE
a0     00a0  *020A      NO-BREAK SPACE
60     0060  *020E      GRAVE ACCENT
b4     00B4  *020F      ACUTE ACCENT
98     02dc  *0210      SMALL TILDE
5e     005E  *0211      CIRCUMFLEX ACCENT
af     00af  *0212      MACRON
a8     00a8  *0216      DIAERESIS
b8     00B8  *021B      CEDILLA
5f     005F  *021D      LOW LINE
ad     00ad  *0222      SOFT HYPHEN
2d     002D  *0223      HYPHEN-MINUS
96     2013  *022B      EN DASH
97     2014  *022C      EM DASH
2c     002C  *0234      COMMA
3b     003B  *0243      SEMICOLON
3a     003A  *0247      COLON
21     0021  *026E      EXCLAMATION MARK
a1     00a1  *026F      INVERTED EXCLAMATION MARK
3f     003F  *0273      QUESTION MARK
bf     00BF  *0274      INVERTED QUESTION MARK
2e     002E  *0281      FULL STOP
85     2026 !*0281+1    HORIZONTAL ELLIPSIS [expansion]
b7     00B7  *0292      MIDDLE DOT
27     0027  *02EE      APOSTROPHE
91     2018  *02EF      LEFT SINGLE QUOTATION MARK
92     2019  *02F0      RIGHT SINGLE QUOTATION MARK
82     201a  *02F1      SINGLE LOW-9 QUOTATION MARK
8b     2039  *02F3      SINGLE LEFT-POINTING ANGLE QUOTATION MARK
9b     203a  *02F4      SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
22     0022  *02F5      QUOTATION MARK
93     201C  *02F6      LEFT DOUBLE QUOTATION MARK
94     201d  *02F7      RIGHT DOUBLE QUOTATION MARK
84     201e  *02F8      DOUBLE LOW-9 QUOTATION MARK
ab     00ab  *02FD      LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
bb     00BB  *02FE      RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
28     0028  *02FF      LEFT PARENTHESIS
29     0029  *0300      RIGHT PARENTHESIS
5b     005B  *0301      LEFT SQUARE BRACKET
5d     005D  *0302      RIGHT SQUARE BRACKET
7b     007B  *0303      LEFT CURLY BRACKET
7d     007D  *0304      RIGHT CURLY BRACKET
a7     00a7  *0351      SECTION SIGN
b6     00B6  *0352      PILCROW SIGN
a9     00a9  *0354      COPYRIGHT SIGN
ae     00ae  *0355      REGISTERED SIGN
40     0040  *0356      COMMERCIAL AT
2a     002A  *0357      ASTERISK
2f     002F  *035C      SOLIDUS
5c     005C  *035E      REVERSE SOLIDUS
26     0026  *035F      AMPERSAND
23     0023  *0362      NUMBER SIGN
25     0025  *0363      PERCENT SIGN
89     2030  *0365      PER MILLE SIGN
86     2020  *036A      DAGGER
87     2021  *036B      DOUBLE DAGGER
95     2022  *036C      BULLET
88     02c6  *03F0      MODIFIER LETTER CIRCUMFLEX ACCENT
b0     00B0  *044B      DEGREE SIGN
2b     002B  *0550      PLUS SIGN
b1     00B1  *0551      PLUS-MINUS SIGN
f7     00f7  *0552      DIVISION SIGN
d7     00d7  *0553      MULTIPLICATION SIGN
3c     003C  *0554      LESS-THAN SIGN
3d     003D  *0555      EQUALS SIGN
3e     003E  *0556      GREATER-THAN SIGN
ac     00ac  *0557      NOT SIGN
7c     007C  *0558      VERTICAL LINE
a6     00a6  *0559      BROKEN BAR
7e     007E  *055B      TILDE
a4     00a4   11DF      CURRENCY SIGN
a2     00a2   11E0      CENT SIGN
24     0024   11E1      DOLLAR SIGN
a3     00a3   11E2      POUND SIGN
a5     00a5   11E3      YEN SIGN
80     20AC   11F8      EURO SIGN
30     0030   1205      DIGIT ZERO
31     0031   1206      DIGIT ONE
b9     00B9   1206      SUPERSCRIPT ONE
bc     00BC ! 1206+1/2  VULGAR FRACTION ONE QUARTER [expansion]
bd     00BD ! 1206+1/2  VULGAR FRACTION ONE HALF [expansion]
32     0032   1207      DIGIT TWO
b2     00B2   1207      SUPERSCRIPT TWO
b3     00B3   1208      SUPERSCRIPT THREE
33     0033   1208      DIGIT THREE
be     00BE ! 1208+1/2  VULGAR FRACTION THREE QUARTERS [expansion]
34     0034   1209      DIGIT FOUR
35     0035   120A      DIGIT FIVE
36     0036   120B      DIGIT SIX
37     0037   120C      DIGIT SEVEN
38     0038   120D      DIGIT EIGHT
39     0039   120E      DIGIT NINE
aa     00aa   120F      FEMININE ORDINAL INDICATOR
41     0041   120F      LATIN CAPITAL LETTER A
61     0061   120F      LATIN SMALL LETTER A
c0     00c0   120F      LATIN CAPITAL LETTER A WITH GRAVE
c1     00c1   120F      LATIN SMALL LETTER A WITH ACUTE
c2     00c2   120F      LATIN CAPITAL LETTER A WITH CIRCUMFLEX
c3     00c3   120F      LATIN CAPITAL LETTER A WITH TILDE
e0     00e0   120F      LATIN SMALL LETTER A WITH GRAVE
e1     00e1   120F      LATIN SMALL LETTER A WITH ACUTE
e2     00e2   120F      LATIN SMALL LETTER A WITH CIRCUMFLEX
e3     00e3   120F      LATIN SMALL LETTER A WITH TILDE
42     0042   1225      LATIN CAPITAL LETTER B
62     0062   1225      LATIN SMALL LETTER B
43     0043   123D      LATIN CAPITAL LETTER C
63     0063   123D      LATIN SMALL LETTER C
c7     00c7   123D      LATIN CAPITAL LETTER C WITH CEDILLA
e7     00e7   123D      LATIN SMALL LETTER C WITH CEDILLA
44     0044   1250      LATIN CAPITAL LETTER D
64     0064   1250      LATIN SMALL LETTER D
d0     00d0   1250      LATIN CAPITAL LETTER ETH
f0     00f0   1250      LATIN SMALL LETTER ETH
45     0045   126B      LATIN CAPITAL LETTER E
65     0065   126B      LATIN SMALL LETTER E
c8     00c8   126B      LATIN CAPITAL LETTER E WITH GRAVE;
c9     00c9   126B      LATIN CAPITAL LETTER E WITH ACUTE
ca     00ca   126B      LATIN CAPITAL LETTER E WITH CIRCUMFLEX;
cb     00cb   126B      LATIN CAPITAL LETTER E WITH DIAERESIS
e8     00e8   126B      LATIN SMALL LETTER E WITH GRAVE
e9     00e9   126B      LATIN SMALL LETTER E WITH ACUTE
ea     00ea   126B      LATIN SMALL LETTER E WITH CIRCUMFLEX
eb     00eb   126B      LATIN SMALL LETTER E WITH DIAERESIS
46     0046   12A3      LATIN CAPITAL LETTER F
66     0066   12A3      LATIN SMALL LETTER F
83     0192   12AA      LATIN SMALL LETTER F WITH HOOK
47     0047   12B0      LATIN CAPITAL LETTER G
67     0067   12B0      LATIN SMALL LETTER G
48     0048   12D3      LATIN CAPITAL LETTER H
68     0068   12D3      LATIN SMALL LETTER H
49     0049   12EC      LATIN CAPITAL LETTER I
69     0069   12EC      LATIN SMALL LETTER I
cc     00cc   12EC      LATIN CAPITAL LETTER I WITH GRAVE
cd     00cd   12EC      LATIN CAPITAL LETTER I WITH ACUTE
ce     00ce   12EC      LATIN CAPITAL LETTER I WITH CIRCUMFLEX
cf     00cf   12EC      LATIN CAPITAL LETTER I WITH DIAERESIS
ec     00ec   12EC      LATIN SMALL LETTER I WITH GRAVE
ed     00ed   12EC      LATIN SMALL LETTER I WITH ACUTE
ee     00ee   12EC      LATIN SMALL LETTER I WITH CIRCUMFLEX
ef     00ef   12EC      LATIN SMALL LETTER I WITH DIAERESIS
4a     004A   1305      LATIN CAPITAL LETTER J
6a     006a   1305      LATIN SMALL LETTER J
4b     004B   131E      LATIN CAPITAL LETTER K
6b     006b   131E      LATIN SMALL LETTER K
4c     004C   1330      LATIN CAPITAL LETTER L
6c     006c   1330      LATIN SMALL LETTER L
4d     004D   135F      LATIN CAPITAL LETTER M
6d     006d   135F      LATIN SMALL LETTER M
4e     004E   136D      LATIN CAPITAL LETTER N
6e     006e   136D      LATIN SMALL LETTER N
d1     00d1   136D      LATIN CAPITAL LETTER N WITH TILDE
f1     00f1   136D      LATIN SMALL LETTER N WITH TILDE
4f     004F   138E      LATIN CAPITAL LETTER O
6f     006f   138E      LATIN SMALL LETTER O
d2     00d2   138E      LATIN CAPITAL LETTER O WITH GRAVE
d3     00d3   138E      LATIN CAPITAL LETTER O WITH ACUTE
d5     00d5   138E      LATIN CAPITAL LETTER O WITH TILDE
f2     00f2   138E      LATIN SMALL LETTER O WITH GRAVE
f3     00f3   138E      LATIN SMALL LETTER O WITH ACUTE
f5     00f5   138E      LATIN SMALL LETTER O WITH TILDE
ba     00BA   138E      MASCULINE ORDINAL INDICATOR
50     0050   13B3      LATIN CAPITAL LETTER P
70     0070   13B3      LATIN SMALL LETTER P
51     0051   13C8      LATIN CAPITAL LETTER Q
71     0071   13C8      LATIN SMALL LETTER Q
52     0052   13DA      LATIN CAPITAL LETTER R
72     0072   13DA      LATIN SMALL LETTER R
53     0053   1410      LATIN CAPITAL LETTER S
73     0073   1410      LATIN SMALL LETTER S
9a     0161   1410      LATIN SMALL LETTER S WITH CARON
df     00df ! 1410      LATIN SMALL LETTER SHARP S [expansion]
8a     0160   1410      LATIN CAPITAL LETTER S WITH CARON
54     0054   1433      LATIN CAPITAL LETTER T
74     0074   1433      LATIN SMALL LETTER T
de     00de   1433      LATIN CAPITAL LETTER THORN [tailoring]
fe     00fe   1433      LATIN SMALL LETTER THORN [tailoring]
99     2122 ! 1433+1    TRADE MARK SIGN [expansion]
55     0055   1453      LATIN CAPITAL LETTER U
75     0075   1453      LATIN SMALL LETTER U
d9     00d9   1453      LATIN CAPITAL LETTER U WITH GRAVE
da     00da   1453      LATIN CAPITAL LETTER U WITH ACUTE
db     00db   1453      LATIN CAPITAL LETTER U WITH CIRCUMFLEX
f9     00f9   1453      LATIN SMALL LETTER U WITH GRAVE
fa     00fa   1453      LATIN SMALL LETTER U WITH ACUTE
fb     00fb   1453      LATIN SMALL LETTER U WITH CIRCUMFLEX
56     0056   147B      LATIN CAPITAL LETTER V
76     0076   147B      LATIN SMALL LETTER V
57     0057   148D      LATIN CAPITAL LETTER W
77     0077   148D      LATIN SMALL LETTER W
58     0058   1497      LATIN CAPITAL LETTER X
78     0078   1497      LATIN SMALL LETTER X
59     0059   149C      LATIN CAPITAL LETTER Y
79     0079   149C      LATIN SMALL LETTER Y
dd     00dd   149C      LATIN CAPITAL LETTER Y WITH ACUTE
fd     00fd   149C      LATIN SMALL LETTER Y WITH ACUTE
9f     0178   149C      LATIN CAPITAL LETTER Y WITH DIAERESIS
ff     00ff   149C      LATIN SMALL LETTER Y WITH DIAERESIS
dc     00dc   149C      LATIN CAPITAL LETTER U WITH DIAERESIS [tailoring]
fc     00fc   149C      LATIN SMALL LETTER U WITH DIAERESIS [tailoring]
5a     005A   14AD      LATIN CAPITAL LETTER Z
7a     007a   14AD      LATIN SMALL LETTER Z
8e     017d   14AD      LATIN CAPITAL LETTER Z WITH CARON
9e     017e   14AD      LATIN SMALL LETTER Z WITH CARON
c5     00c5   14AD+1    LATIN CAPITAL LETTER A WITH RING ABOVE [tailoring]
e5     00e5   14AD+1    LATIN SMALL LETTER A WITH RING ABOVE [tailoring]
c6     00c6 ! 14AD+2    LATIN CAPITAL LETTER AE [tailoring]
c4     00c4   14AD+2    LATIN CAPITAL LETTER A WITH DIAERESIS [tailoring]
e4     00e4   14AD+2    LATIN SMALL LETTER A WITH DIAERESIS [tailoring]
e6     00e6   14AD+2    LATIN SMALL LETTER AE [tailoring]
d6     00d6   14AD+3    LATIN CAPITAL LETTER O WITH DIAERESIS [tailoring]
f6     00f6   14AD+3    LATIN SMALL LETTER O WITH DIAERESIS [tailoring]
8c     0152 ! 14AD+3    LATIN CAPITAL LIGATURE OE [tailoring]
9c     0153 ! 14AD+3    LATIN SMALL LIGATURE OE [tailoring]
d8     00d8   14AD+3    LATIN CAPITAL LETTER O WITH STROKE [tailoring]
f8     00f8   14AD+3    LATIN SMALL LETTER O WITH STROKE [tailoring]
d4     00d4   14AD+3    LATIN CAPITAL LETTER O WITH CIRCUMFLEX [tailoring]
f4     00f4   14AD+3    LATIN SMALL LETTER O WITH CIRCUMFLEX [tailoring]
b5     00B5   1557      MICRO SIGN

References
----------

dev-private thread "Re: WL#2673 Unicode Collation Algorithm new version"