WL#3865: Vietnamese collation

Affects: Server-5.6   —   Status: Complete   —   Priority: Very High

Support Vietnamese collation for Unicode character sets.
Old Rules
---------

Add these five collations:
utf8_vietnamese_ci    /* utf8 character set Vietnamese */
ucs2_vietnamese_ci    /* ucs2 character set Vietnamese */
utf8mb4_vietnamese_ci /* utf8mb4 character set Vietnamese */
utf16_vietnamese_ci   /* utf16 character set Vietnamese */
utf32_vietnamese_ci   /* utf32 character set Vietnamese */

These collations will all be variants of utf8_unicode_ci
etc., that is, the base collation is what we used for old
collations with Unicode 4.0. The only differences from
utf8_unicode_ci etc. are with respect to the Vietnamese
tailoring rules stated below (A WITH BREVE FOLLOWS A etc.).

The new collations are case insensitive.

The new collations are the same as utf8_unicode_ci
etc. except for precomposed characters which are accented
versions of A D E O U as described in the later section
"Creating an ordered list".

There is no change to ideographic characters derived
from Chinese.

There are no digraphs. Apparently there were 10 digraphs
and 1 trigraph until a few years ago, but that is for old
dictionaries, according to Wikipedia:
http://en.wikipedia.org/wiki/Vietnamese_alphabet

There will be no support for secondary or tertiary sorts
but apparently that is okay for now, according to BUG#4745.
When we do add them, possible names are utf8_vietnamese_w2
etc. if we follow the conventions mentioned for WL#896.

UCA and CLDR
------------

For new collations MySQL follows the Unicode Collation Algorithm (UCA)
with tailoring according to a Common Locale Data Repository (CLDR)
specification, which in this case is the file 'vi.xml', attached.
For details see section "Principles" in WL#2673.

Translating from the XML, the vi.xml CLDR specification is saying
that Vietnamese tailoring is thus:
A WITH BREVE FOLLOWS A
A WITH CIRCUMFLEX FOLLOWS A WITH BREVE
D WITH STROKE FOLLOWS D
E WITH CIRCUMFLEX FOLLOWS E
O WITH CIRCUMFLEX FOLLOWS O
O WITH HORN FOLLOWS O WITH CIRCUMFLEX
U WITH HORN FOLLOWS U

(The first "<reset>" section of the CLDR
can be translated as
"<reset> U+0300 COMBINING GRAVE ACCENT
<sc> U+0309 COMBINING HOOK ABOVE
     U+0303 COMBINING TILDE
     U+0301 COMBINING ACUTE ACCENT
     U+03233COMBINING DOT BELOW
</sc>", which we can ignore.)

We see the same rules in "Vietnamese Alphabetical System"
http://vietunicode.sourceforge.net/charset/vietalphabet.html

Although we failed to get assurance that these
rules are official, we have seen comments
from Vietnamese users who agree with them.
Mimer agrees with them.

A contribution from Doan Minh Phuong
http://lists.mysql.com/internals/34036
does not agree. We will ignore the contribution.

Creating an ordered list
------------------------

The following statements will create a table with a
UCS2 column and populate with all the characters
that appear to be special for Vietnamese.
The groups of INSERT statements are in order.
A group is a series where the base character name
is on the left margin (e.g. "/* A */") followed by
INSERT statements. All characters within a group
are equal to each other on the primary level.

At the primary level, the list exactly reflects the
rules described in section "UCA + CLDR".

create table t (s1 varchar(1) character set ucs2);
/* A */
insert into t values (0x0061); /* LATIN SMALL LETTER A */
insert into t values (0x0041); /* LATIN CAPITAL LETTER A */
insert into t values (0x00E0); /* LATIN SMALL LETTER A WITH GRAVE */
insert into t values (0x00C0); /* LATIN CAPITAL LETTER A WITH GRAVE */
insert into t values (0x1EA3); /* LATIN SMALL LETTER A WITH HOOK ABOVE */
insert into t values (0x1EA2); /* LATIN CAPITAL LETTER A WITH HOOK ABOVE */
insert into t values (0x00E3); /* LATIN SMALL LETTER A WITH TILDE */
insert into t values (0x00C3); /* LATIN CAPITAL LETTER A WITH TILDE */
insert into t values (0x00E1); /* LATIN SMALL LETTER A WITH ACUTE */
insert into t values (0x00C1); /* LATIN CAPITAL LETTER A WITH ACUTE */
insert into t values (0x1EA1); /* LATIN SMALL LETTER A WITH DOT BELOW */
insert into t values (0x1EA0); /* LATIN CAPITAL LETTER A WITH DOT BELOW */
/* A WITH BREVE */
insert into t values (0x0103); /* LATIN SMALL LETTER A WITH BREVE */
insert into t values (0x0102); /* LATIN CAPITAL LETTER A WITH BREVE */
insert into t values (0x1EB1); /* LATIN SMALL LETTER A WITH BREVE AND GRAVE */
insert into t values (0x1EB0); /* LATIN CAPITAL LETTER A WITH BREVE AND GRAVE */
insert into t values (0x1EB3); /* LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE */
insert into t values (0x1EB2); /* LATIN CAPITAL LETTER A WITH BREVE AND HOOK
ABOVE */
insert into t values (0x1EB5); /* LATIN SMALL LETTER A WITH BREVE AND TILDE */
insert into t values (0x1EB4); /* LATIN CAPITAL LETTER A WITH BREVE AND TILDE */
insert into t values (0x1EAF); /* LATIN SMALL LETTER A WITH BREVE AND ACUTE */
insert into t values (0x1EAE); /* LATIN CAPITAL LETTER A WITH BREVE AND ACUTE */
insert into t values (0x1EB7); /* LATIN SMALL LETTER A WITH BREVE AND DOT BELOW */
insert into t values (0x1EB6); /* LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW */
/* A WITH CIRCUMFLEX */
insert into t values (0x00E2); /* LATIN SMALL LETTER A WITH CIRCUMFLEX */
insert into t values (0x00C2); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX */
insert into t values (0x1EA7); /* LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE */
insert into t values (0x1EA6); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND
GRAVE */
insert into t values (0x1EA9); /* LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK
ABOVE */
insert into t values (0x1EA8); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND
HOOK ABOVE */
insert into t values (0x1EAB); /* LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE */
insert into t values (0x1EAA); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND
TILDE */
insert into t values (0x1EA5); /* LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE */
insert into t values (0x1EA4); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND
ACUTE */
insert into t values (0x1EAD); /* LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT
BELOW */
insert into t values (0x1EAC); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT
BELOW */
/* B */
insert into t values ('b'),('B');
/* C */
insert into t values ('c'),('C');
/* D */
insert into t values ('d'),('D');
/* D WITH STROKE */
insert into t values (0x0111); /* LATIN SMALL LETTER D WITH STROKE */
insert into t values (0x0110); /* LATIN CAPITAL LETTER D WITH STROKE */
/* E */
insert into t values (0x0065); /* LATIN SMALL LETTER E */
insert into t values (0x0045); /* LATIN CAPITAL LETTER E */
insert into t values (0x00E8); /* LATIN SMALL LETTER E WITH GRAVE */
insert into t values (0x00C8); /* LATIN CAPITAL LETTER E WITH GRAVE */
insert into t values (0x1EBB); /* LATIN SMALL LETTER E WITH HOOK ABOVE */
insert into t values (0x1EBA); /* LATIN CAPITAL LETTER E WITH HOOK ABOVE */
insert into t values (0x1EBD); /* LATIN SMALL LETTER E WITH TILDE */
insert into t values (0x1EBC); /* LATIN CAPITAL LETTER E WITH TILDE */
insert into t values (0x00E9); /* LATIN SMALL LETTER E WITH ACUTE */
insert into t values (0x00C9); /* LATIN CAPITAL LETTER E WITH ACUTE */
insert into t values (0x1EB9); /* LATIN SMALL LETTER E WITH DOT BELOW */
insert into t values (0x1EB8); /* LATIN CAPITAL LETTER E WITH DOT BELOW */
/* E WITH CIRCUMFLEX */
insert into t values (0x00EA); /* LATIN SMALL LETTER E WITH CIRCUMFLEX */
insert into t values (0x00CA); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX */
insert into t values (0x1EC1); /* LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE */
insert into t values (0x1EC0); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND
GRAVE */
insert into t values (0x1EC3); /* LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK
ABOVE */
insert into t values (0x1EC2); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND
HOOK ABOVE */
insert into t values (0x1EC5); /* LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE */
insert into t values (0x1EC4); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND
TILDE */
insert into t values (0x1EBF); /* LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE */
insert into t values (0x1EBE); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND
ACUTE */
insert into t values (0x1EC7); /* LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT
BELOW */
insert into t values (0x1EC6); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND DOT
BELOW */
/* F */
insert into t values ('f'),('F');
/* G */
insert into t values ('g'),('G');
/* H */
insert into t values ('h'),('H');
/* I */
insert into t values (0x0069); /* LATIN SMALL LETTER I */
insert into t values (0x0049); /* LATIN CAPITAL LETTER I */
insert into t values (0x00EC); /* LATIN SMALL LETTER I WITH GRAVE */
insert into t values (0x00CC); /* LATIN CAPITAL LETTER I WITH GRAVE */
insert into t values (0x1EC9); /* LATIN SMALL LETTER I WITH HOOK ABOVE */
insert into t values (0x1EC8); /* LATIN CAPITAL LETTER I WITH HOOK ABOVE */
insert into t values (0x0129); /* LATIN SMALL LETTER I WITH TILDE */
insert into t values (0x0128); /* LATIN CAPITAL LETTER I WITH TILDE */
insert into t values (0x00ED); /* LATIN SMALL LETTER I WITH ACUTE */
insert into t values (0x00CD); /* LATIN CAPITAL LETTER I WITH ACUTE */
insert into t values (0x1ECB); /* LATIN SMALL LETTER I WITH DOT BELOW */
insert into t values (0x1ECA); /* LATIN CAPITAL LETTER I WITH DOT BELOW */
/* J */
insert into t values ('j'),('J');
/* K */
insert into t values ('k'),('K');
/* L */
insert into t values ('l'),('L');
/* M */
insert into t values ('m'),('M');
/* N */
insert into t values ('n'),('N');
/* O */
insert into t values (0x006F); /* LATIN SMALL LETTER O */
insert into t values (0x004F); /* LATIN CAPITAL LETTER O */
insert into t values (0x00F2); /* LATIN SMALL LETTER O WITH GRAVE */
insert into t values (0x00D2); /* LATIN CAPITAL LETTER O WITH GRAVE */
insert into t values (0x1ECF); /* LATIN SMALL LETTER O WITH HOOK ABOVE */
insert into t values (0x1ECE); /* LATIN CAPITAL LETTER O WITH HOOK ABOVE */
insert into t values (0x00F5); /* LATIN SMALL LETTER O WITH TILDE */
insert into t values (0x00D5); /* LATIN CAPITAL LETTER O WITH TILDE */
insert into t values (0x00F3); /* LATIN SMALL LETTER O WITH ACUTE */
insert into t values (0x00D3); /* LATIN CAPITAL LETTER O WITH ACUTE */
insert into t values (0x1ECD); /* LATIN SMALL LETTER O WITH DOT BELOW */
insert into t values (0x1ECC); /* LATIN CAPITAL LETTER O WITH DOT BELOW */
/* O WITH CIRCUMFLEX */
insert into t values (0x00F4); /* LATIN SMALL LETTER O WITH CIRCUMFLEX */
insert into t values (0x00D4); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX */
insert into t values (0x1ED3); /* LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE */
insert into t values (0x1ED2); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND
GRAVE */
insert into t values (0x1ED5); /* LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK
ABOVE */
insert into t values (0x1ED4); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND
HOOK ABOVE */
insert into t values (0x1ED7); /* LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE */
insert into t values (0x1ED6); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND
TILDE */
insert into t values (0x1ED1); /* LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE */
insert into t values (0x1ED0); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND
ACUTE */
insert into t values (0x1ED9); /* LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT
BELOW */
insert into t values (0x1ED8); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND DOT
BELOW */
/* O WITH HORN */
insert into t values (0x01A1); /* LATIN SMALL LETTER O WITH HORN */
insert into t values (0x01A0); /* LATIN CAPITAL LETTER O WITH HORN */
insert into t values (0x1EDD); /* LATIN SMALL LETTER O WITH HORN AND GRAVE */
insert into t values (0x1EDC); /* LATIN CAPITAL LETTER O WITH HORN AND GRAVE */
insert into t values (0x1EDF); /* LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE */
insert into t values (0x1EDE); /* LATIN CAPITAL LETTER O WITH HORN AND HOOK ABOVE */
insert into t values (0x1EE1); /* LATIN SMALL LETTER O WITH HORN AND TILDE */
insert into t values (0x1EE0); /* LATIN CAPITAL LETTER O WITH HORN AND TILDE */
insert into t values (0x1EDB); /* LATIN SMALL LETTER O WITH HORN AND ACUTE */
insert into t values (0x1EDA); /* LATIN CAPITAL LETTER O WITH HORN AND ACUTE */
insert into t values (0x1EE3); /* LATIN SMALL LETTER O WITH HORN AND DOT BELOW */
insert into t values (0x1EE2); /* LATIN CAPITAL LETTER O WITH HORN AND DOT BELOW */
/* P */
insert into t values ('p'),('P');
/* Q */
insert into t values ('q'),('Q');
/* R */
insert into t values ('r'),('R');
/* S */
insert into t values ('s'),('S');
/* T */
insert into t values ('t'),('T');
/* U */
insert into t values (0x0075); /* LATIN SMALL LETTER U */
insert into t values (0x0055); /* LATIN CAPITAL LETTER U */
insert into t values (0x00F9); /* LATIN SMALL LETTER U WITH GRAVE */
insert into t values (0x00D9); /* LATIN CAPITAL LETTER U WITH GRAVE */
insert into t values (0x1EE7); /* LATIN SMALL LETTER U WITH HOOK ABOVE */
insert into t values (0x1EE6); /* LATIN CAPITAL LETTER U WITH HOOK ABOVE */
insert into t values (0x0169); /* LATIN SMALL LETTER U WITH TILDE */
insert into t values (0x0168); /* LATIN CAPITAL LETTER U WITH TILDE */
insert into t values (0x00FA); /* LATIN SMALL LETTER U WITH ACUTE */
insert into t values (0x00DA); /* LATIN CAPITAL LETTER U WITH ACUTE */
insert into t values (0x1EE5); /* LATIN SMALL LETTER U WITH DOT BELOW */
insert into t values (0x1EE4); /* LATIN CAPITAL LETTER U WITH DOT BELOW */
/* U WITH HORN */
insert into t values (0x01B0); /* LATIN SMALL LETTER U WITH HORN */
insert into t values (0x01AF); /* LATIN CAPITAL LETTER U WITH HORN */
insert into t values (0x1EEB); /* LATIN SMALL LETTER U WITH HORN AND GRAVE */
insert into t values (0x1EEA); /* LATIN CAPITAL LETTER U WITH HORN AND GRAVE */
insert into t values (0x1EED); /* LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE */
insert into t values (0x1EEC); /* LATIN CAPITAL LETTER U WITH HORN AND HOOK ABOVE */
insert into t values (0x1EEF); /* LATIN SMALL LETTER U WITH HORN AND TILDE */
insert into t values (0x1EEE); /* LATIN CAPITAL LETTER U WITH HORN AND TILDE */
insert into t values (0x1EE9); /* LATIN SMALL LETTER U WITH HORN AND ACUTE */
insert into t values (0x1EE8); /* LATIN CAPITAL LETTER U WITH HORN AND ACUTE */
insert into t values (0x1EF1); /* LATIN SMALL LETTER U WITH HORN AND DOT BELOW */
insert into t values (0x1EF0); /* LATIN CAPITAL LETTER U WITH HORN AND DOT BELOW */
/* V */
insert into t values ('v'),('V');
/* W */
insert into t values ('w'),('W');
/* X */
insert into t values ('x'),('X');
/* Y */
insert into t values (0x0079); /* LATIN SMALL LETTER Y */
insert into t values (0x0059); /* LATIN CAPITAL LETTER Y */
insert into t values (0x1EF3); /* LATIN SMALL LETTER Y WITH GRAVE */
insert into t values (0x1EF2); /* LATIN CAPITAL LETTER Y WITH GRAVE */
insert into t values (0x1EF7); /* LATIN SMALL LETTER Y WITH HOOK ABOVE */
insert into t values (0x1EF6); /* LATIN CAPITAL LETTER Y WITH HOOK ABOVE */
insert into t values (0x1EF9); /* LATIN SMALL LETTER Y WITH TILDE */
insert into t values (0x1EF8); /* LATIN CAPITAL LETTER Y WITH TILDE */
insert into t values (0x00FD); /* LATIN SMALL LETTER Y WITH ACUTE */
insert into t values (0x00DD); /* LATIN CAPITAL LETTER Y WITH ACUTE */
insert into t values (0x1EF5); /* LATIN SMALL LETTER Y WITH DOT BELOW */
insert into t values (0x1EF4); /* LATIN CAPITAL LETTER Y WITH DOT BELOW */
/* Z */
insert into t values ('z'),('Z');

Ordered List with Secondary Weights Treated As Primary Weights
--------------------------------------------------------------

This section is informative. It describes a collation which we
may add later. But it is not part of the requirement for WL#3865.

There could be a second collation, where the collation strength is 2.
That is, the secondary level should be taken into account not
only for sorting, but for comparison as well.

This applies only to the characters listed in the section
"Creating an ordered list". The relevant accents, in order, are
GRAVE, HOOK ABOVE, TILDE, ACUTE, DOT BELOW.
So the list is in order if one ignores the differences in case.

For example, A WITH ACUTE is greater than A WITH GRAVE
but less than A WITH DOT BELOW.

The final suffix in the collation name will be _w2a instead of _ci,

If this occurs after WL#896 "Primary, Secondary and Tertiary Sorts",
then we will have a _w2a ("Weight 2 Always") as described
in WL#896. In that case, secondary-level ordering applies to
all accents and probably can be generic (not specific for Vietnamese).

New Rules
---------

The original plan was to add these five collations:
utf8_vietnamese_520_ci    /* utf8 character set Vietnamese */
ucs2_vietnamese_520_ci    /* ucs2 character set Vietnamese */
utf8mb4_vietnamese_520_ci /* utf8mb4 character set Vietnamese */
utf16_vietnamese_520_ci   /* utf16 character set Vietnamese */
utf32_vietnamese_520_ci   /* utf32 character set Vietnamese */

The section "Creating an ordered list" would still be correct,
so all the characters in that list would still be in order.

On 2010-11-01 these collations were eliminated from the plan.
See dev-private email dated 2010-11-01 with
subject line = WL#3865 Vietnamese collation.

References
----------

BUG#4745 Add Vietnamese collation for the ucs2 and utf8 Unicode character sets

Wikipedia entry for Vietnamese alphabet
http://en.wikipedia.org/wiki/Vietnamese_alphabet

"Vietnamese Unicode FAQs"
http://vietunicode.sourceforge.net/

Unicode database
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

"Vietnamese Alphabetical System"
http://vietunicode.sourceforge.net/charset/vietalphabet.html

"Vietnamese Collation Chart"
http://vietunicode.sourceforge.net/charset/v3.htm

"Vietnamese UTF8 collation support"
http://lists.mysql.com/internals/34036
http://lists.mysql.com/internals/34062

"Vietnamese is a complex language on Windows"
http://blogs.msdn.com/michkap/archive/2005/08/27/457224.aspx
NB: the name Quan Nguyen appears here and also in BUG#4745.

dev-private email thread "Vietnamese collation [Was: MySQL contact]"
[ mysql internal archive ] /mailarchive/mail.php?folder=4&mail=33763

The code will produce results as shown in the High Level Specification.