WL#1875: Case insensitive Czech collation

Affects: Server-6.1 — Status: Assigned

Description
High Level Architecture

We have only case sensitive Czech collations for cp1250 and latin2.
We want to add case insensitive collations too.

See also Feature Request BUG#3444.


See also a contributed patch from Pavel Stehule
implementing Czech case insensitive collation for cp1250:

http://lists.mysql.com/internals/34318

Assume we progress with WL#5170 Swedish collation.
Assume we do the same sort of thing for Czech.
Then we'll follow UCA DUCET as described in
WL#2673 "Unicode Collation Algorithm new version",
and tailor according to CLDR.

The CLDR (Unicode Common Locale Data Repository)
http://unicode.org/repos/cldr/trunk/docs/web/repository_access.html
has one Czech-rule file cs.xml with three sets of rules,
"standard" and "digits-after" and "search".

Tailoring at the primary level is:

For collation type="standard":
C BEFORE C WITH CARON
H BEFORE CH
R BEFORE R WITH CARON
S BEFORE S WITH CARON
Z BEFORE Z WITH CARON
This corresponds to the Czech Wikipedia page
"Abecední řazení":
A B C Č D E F G H Ch I J K L M N O P Q R Ř S Š T U V W X Y Z Ž
http://cs.wikipedia.org/wiki/Abecedn%C3%AD_%C5%99azen%C3%AD
MySQL's current utf8_czech_ci collation
http://www.collation-charts.org/mysql60/mysql604.utf8_czech_ci.html
already is "standard".

For collation type="digits-after":
same as collation type="standard", except that
the digits 0123456789 come after letters.

For collation type="search":
(looking only at Czech-specific rules)
A BEFORE A WITH ACUTE
C BEFORE C WITH CARON
D BEFORE D WITH CARON
E BEFORE E WITH ACUTE
E WITH ACUTE BEFORE E WITH CARON
H BEFORE CH
I BEFORE I WITH ACUTE
N BEFORE N WITH CARON
O BEFORE O WITH ACUTE
R BEFORE R WITH CARON
S BEFORE S WITH CARON
T BEFORE T WITH CARON
U BEFORE U WITH ACUTE
U WITH ACUTE BEFORE U WITH RING ABOVE
Y BEFORE Y WITH ACUTE
Z BEFORE Z WITH CARON
Two bug reports (BUG#32404, BUG#61615) are asking for
sensitivity with vowel accents, which only "search" delivers.

Rules can be applicable to Unicode character sets.
Suggested name = utf8_czech_600_ci etc.
The expectation is that 8-bit character sets matter less.

It's unclear whether Czech with standard rules is one of
the collations that we call "tricky".

References
----------

Follow the same sort of ideas as seen in
WL#5170 Swedish collation.

BUG#3444 Case sensitivity in czech comparisons
BUG#8644 make the cp1250_czech_cs act like latin2_czech_cs
BUG#32404 Cannot obtain accent sensitive czech collation
BUG#34371 Czech collation ("not a bug")
BUG#61615 Mysql evaluates chars with a comma and chars without a comma as the
same ("duplicate")

Email thread "utf8_czech_ci"
[ mysql intranet archive ]/secure/mailarchive/mail.php?folder=4&mail=13244