WL#5476: Croatian collation

Affects: Server-5.6 — Status: Complete

Description
Dependent Tasks
High Level Architecture

Support Croatian collation for Unicode character sets.

WL#2673: Unicode Collation Algorithm new version

Add Unicode-4.0 based collations
utf8_croatian_ci
ucs2_croatian_ci
utf8mb4_croatian_ci
utf16_croatian_ci
utf32_croatian_ci

These collations will all be variants of utf8_unicode_ci
etc., that is, the base collation is what we used for old
collations with Unicode 4.0.

We will have tailoring for Croatian letters:
Č, Ć, Dž, Đ, Lj, Nj, Š, Ž.

The new collations are case insensitive.
There will be no support for secondary or tertiary sorts.

utf8_croatian_520_ci etc.
-------------------------

Originally there was an intent to
add five Unicode-5.2.0 based collations:
utf8_croatian_520_ci
ucs2_croatian_520_ci
utf8mb4_croatian_520_ci
utf16_croatian_520_ci
utf32_croatian_520_ci

with the same tailoring as in utf8_croatian_ci etc.,
but these collations will all be based on Unicode 5.2.

This part of the plan is cancelled.
We must be careful about adding new collations,
because InnoDB will support only a few more.

UCA and CLDR
------------

For new collations MySQL follows the Unicode Collation Algorithm (UCA)
with tailoring according to a Common Locale Data Repository (CLDR)
specification, which in this case is the file 'hr.xml', attached.
For details see section "Principles" in WL#2673.

Translating from the XML, the hr.xml CLDR specification is saying
that Croatian tailoring is thus:

C WITH CARON follows C
C WITH ACUTE follows C WITH CARON
"D + Z WITH CARON" follows D (contraction)
"DZ WITH CARON" is equal to "D + Z WITH CARON"
D WITH STROKE follows "DZ WITH CARON"
"L + J" follows L (contraction)
LJ is equal to "L + J"
"N + J" follows L (contraction)
NJ is equal to "N + J"
S WITH CARON follows S
Z WITH CARON follows Z

Sorting order is: 
A
B
C
Č
Ć
D
DŽ Ǆ
Đ
E
F
G
H
I
J
K
L
LJ Ǉ
M
N
NJ Ǌ
O
P
Q
R
S
Š
T
U
V
W
X
Y
Z
Ž 

Tailoring
---------
&C < č <<< Č < ć <<< Ć

D < dž = ǆ <<< dŽ <<< Dž = ǅ <<< DŽ = Ǆ
  < đ <<< Đ

L < lj = ǉ  <<< lJ <<< Lj = ǈ <<< LJ = Ǉ

N < nj = ǌ  <<< nJ <<< Nj = ǋ <<< NJ = Ǌ

S < š <<< Š

Z < ž <<< Ž

The same, using code notation:

&C < \u010D <<< \u010C < \u0107 <<< \u0106

&D < d\\u017E = \u01C6 <<< d\u017D <<< D\u017E = \u01C5 <<< D\u017D = \u01C4
   < \u0111 <<< \u0110

&L < lj = \u01C9  <<< lJ <<< Lj = \u01C8 <<< LJ = \u01C7

&N < nj = \u01CC  <<< nJ <<< Nj = \u01CB <<< NJ = \u01CA

&S < \u0161 <<< \u0160

&Z < \u017E <<< \u017D


References
----------

Unicode database
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
(This is for reference only -- we won't support the latest version.)

Croatian CLDR file
http://unicode.org/cldr/trac/browser/trunk/common/collation/hr.xml

Real Croatian collations for cp1250, latin2
http://forge.mysql.com/worklog/task.php?id=3286