WL#3865: Vietnamese collation
Affects: Server-5.6
—
Status: Complete
Support Vietnamese collation for Unicode character sets.
Old Rules --------- Add these five collations: utf8_vietnamese_ci /* utf8 character set Vietnamese */ ucs2_vietnamese_ci /* ucs2 character set Vietnamese */ utf8mb4_vietnamese_ci /* utf8mb4 character set Vietnamese */ utf16_vietnamese_ci /* utf16 character set Vietnamese */ utf32_vietnamese_ci /* utf32 character set Vietnamese */ These collations will all be variants of utf8_unicode_ci etc., that is, the base collation is what we used for old collations with Unicode 4.0. The only differences from utf8_unicode_ci etc. are with respect to the Vietnamese tailoring rules stated below (A WITH BREVE FOLLOWS A etc.). The new collations are case insensitive. The new collations are the same as utf8_unicode_ci etc. except for precomposed characters which are accented versions of A D E O U as described in the later section "Creating an ordered list". There is no change to ideographic characters derived from Chinese. There are no digraphs. Apparently there were 10 digraphs and 1 trigraph until a few years ago, but that is for old dictionaries, according to Wikipedia: http://en.wikipedia.org/wiki/Vietnamese_alphabet There will be no support for secondary or tertiary sorts but apparently that is okay for now, according to BUG#4745. When we do add them, possible names are utf8_vietnamese_w2 etc. if we follow the conventions mentioned for WL#896. UCA and CLDR ------------ For new collations MySQL follows the Unicode Collation Algorithm (UCA) with tailoring according to a Common Locale Data Repository (CLDR) specification, which in this case is the file 'vi.xml', attached. For details see section "Principles" in WL#2673. Translating from the XML, the vi.xml CLDR specification is saying that Vietnamese tailoring is thus: A WITH BREVE FOLLOWS A A WITH CIRCUMFLEX FOLLOWS A WITH BREVE D WITH STROKE FOLLOWS D E WITH CIRCUMFLEX FOLLOWS E O WITH CIRCUMFLEX FOLLOWS O O WITH HORN FOLLOWS O WITH CIRCUMFLEX U WITH HORN FOLLOWS U (The first "" section of the CLDR can be translated as " U+0300 COMBINING GRAVE ACCENT U+0309 COMBINING HOOK ABOVE U+0303 COMBINING TILDE U+0301 COMBINING ACUTE ACCENT U+03233COMBINING DOT BELOW ", which we can ignore.) We see the same rules in "Vietnamese Alphabetical System" http://vietunicode.sourceforge.net/charset/vietalphabet.html Although we failed to get assurance that these rules are official, we have seen comments from Vietnamese users who agree with them. Mimer agrees with them. A contribution from Doan Minh Phuong http://lists.mysql.com/internals/34036 does not agree. We will ignore the contribution. Creating an ordered list ------------------------ The following statements will create a table with a UCS2 column and populate with all the characters that appear to be special for Vietnamese. The groups of INSERT statements are in order. A group is a series where the base character name is on the left margin (e.g. "/* A */") followed by INSERT statements. All characters within a group are equal to each other on the primary level. At the primary level, the list exactly reflects the rules described in section "UCA + CLDR". create table t (s1 varchar(1) character set ucs2); /* A */ insert into t values (0x0061); /* LATIN SMALL LETTER A */ insert into t values (0x0041); /* LATIN CAPITAL LETTER A */ insert into t values (0x00E0); /* LATIN SMALL LETTER A WITH GRAVE */ insert into t values (0x00C0); /* LATIN CAPITAL LETTER A WITH GRAVE */ insert into t values (0x1EA3); /* LATIN SMALL LETTER A WITH HOOK ABOVE */ insert into t values (0x1EA2); /* LATIN CAPITAL LETTER A WITH HOOK ABOVE */ insert into t values (0x00E3); /* LATIN SMALL LETTER A WITH TILDE */ insert into t values (0x00C3); /* LATIN CAPITAL LETTER A WITH TILDE */ insert into t values (0x00E1); /* LATIN SMALL LETTER A WITH ACUTE */ insert into t values (0x00C1); /* LATIN CAPITAL LETTER A WITH ACUTE */ insert into t values (0x1EA1); /* LATIN SMALL LETTER A WITH DOT BELOW */ insert into t values (0x1EA0); /* LATIN CAPITAL LETTER A WITH DOT BELOW */ /* A WITH BREVE */ insert into t values (0x0103); /* LATIN SMALL LETTER A WITH BREVE */ insert into t values (0x0102); /* LATIN CAPITAL LETTER A WITH BREVE */ insert into t values (0x1EB1); /* LATIN SMALL LETTER A WITH BREVE AND GRAVE */ insert into t values (0x1EB0); /* LATIN CAPITAL LETTER A WITH BREVE AND GRAVE */ insert into t values (0x1EB3); /* LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE */ insert into t values (0x1EB2); /* LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE */ insert into t values (0x1EB5); /* LATIN SMALL LETTER A WITH BREVE AND TILDE */ insert into t values (0x1EB4); /* LATIN CAPITAL LETTER A WITH BREVE AND TILDE */ insert into t values (0x1EAF); /* LATIN SMALL LETTER A WITH BREVE AND ACUTE */ insert into t values (0x1EAE); /* LATIN CAPITAL LETTER A WITH BREVE AND ACUTE */ insert into t values (0x1EB7); /* LATIN SMALL LETTER A WITH BREVE AND DOT BELOW */ insert into t values (0x1EB6); /* LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW */ /* A WITH CIRCUMFLEX */ insert into t values (0x00E2); /* LATIN SMALL LETTER A WITH CIRCUMFLEX */ insert into t values (0x00C2); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX */ insert into t values (0x1EA7); /* LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE */ insert into t values (0x1EA6); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE */ insert into t values (0x1EA9); /* LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE */ insert into t values (0x1EA8); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE */ insert into t values (0x1EAB); /* LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE */ insert into t values (0x1EAA); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE */ insert into t values (0x1EA5); /* LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE */ insert into t values (0x1EA4); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE */ insert into t values (0x1EAD); /* LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW */ insert into t values (0x1EAC); /* LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW */ /* B */ insert into t values ('b'),('B'); /* C */ insert into t values ('c'),('C'); /* D */ insert into t values ('d'),('D'); /* D WITH STROKE */ insert into t values (0x0111); /* LATIN SMALL LETTER D WITH STROKE */ insert into t values (0x0110); /* LATIN CAPITAL LETTER D WITH STROKE */ /* E */ insert into t values (0x0065); /* LATIN SMALL LETTER E */ insert into t values (0x0045); /* LATIN CAPITAL LETTER E */ insert into t values (0x00E8); /* LATIN SMALL LETTER E WITH GRAVE */ insert into t values (0x00C8); /* LATIN CAPITAL LETTER E WITH GRAVE */ insert into t values (0x1EBB); /* LATIN SMALL LETTER E WITH HOOK ABOVE */ insert into t values (0x1EBA); /* LATIN CAPITAL LETTER E WITH HOOK ABOVE */ insert into t values (0x1EBD); /* LATIN SMALL LETTER E WITH TILDE */ insert into t values (0x1EBC); /* LATIN CAPITAL LETTER E WITH TILDE */ insert into t values (0x00E9); /* LATIN SMALL LETTER E WITH ACUTE */ insert into t values (0x00C9); /* LATIN CAPITAL LETTER E WITH ACUTE */ insert into t values (0x1EB9); /* LATIN SMALL LETTER E WITH DOT BELOW */ insert into t values (0x1EB8); /* LATIN CAPITAL LETTER E WITH DOT BELOW */ /* E WITH CIRCUMFLEX */ insert into t values (0x00EA); /* LATIN SMALL LETTER E WITH CIRCUMFLEX */ insert into t values (0x00CA); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX */ insert into t values (0x1EC1); /* LATIN SMALL LETTER E WITH CIRCUMFLEX AND GRAVE */ insert into t values (0x1EC0); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND GRAVE */ insert into t values (0x1EC3); /* LATIN SMALL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE */ insert into t values (0x1EC2); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND HOOK ABOVE */ insert into t values (0x1EC5); /* LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE */ insert into t values (0x1EC4); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND TILDE */ insert into t values (0x1EBF); /* LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE */ insert into t values (0x1EBE); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND ACUTE */ insert into t values (0x1EC7); /* LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW */ insert into t values (0x1EC6); /* LATIN CAPITAL LETTER E WITH CIRCUMFLEX AND DOT BELOW */ /* F */ insert into t values ('f'),('F'); /* G */ insert into t values ('g'),('G'); /* H */ insert into t values ('h'),('H'); /* I */ insert into t values (0x0069); /* LATIN SMALL LETTER I */ insert into t values (0x0049); /* LATIN CAPITAL LETTER I */ insert into t values (0x00EC); /* LATIN SMALL LETTER I WITH GRAVE */ insert into t values (0x00CC); /* LATIN CAPITAL LETTER I WITH GRAVE */ insert into t values (0x1EC9); /* LATIN SMALL LETTER I WITH HOOK ABOVE */ insert into t values (0x1EC8); /* LATIN CAPITAL LETTER I WITH HOOK ABOVE */ insert into t values (0x0129); /* LATIN SMALL LETTER I WITH TILDE */ insert into t values (0x0128); /* LATIN CAPITAL LETTER I WITH TILDE */ insert into t values (0x00ED); /* LATIN SMALL LETTER I WITH ACUTE */ insert into t values (0x00CD); /* LATIN CAPITAL LETTER I WITH ACUTE */ insert into t values (0x1ECB); /* LATIN SMALL LETTER I WITH DOT BELOW */ insert into t values (0x1ECA); /* LATIN CAPITAL LETTER I WITH DOT BELOW */ /* J */ insert into t values ('j'),('J'); /* K */ insert into t values ('k'),('K'); /* L */ insert into t values ('l'),('L'); /* M */ insert into t values ('m'),('M'); /* N */ insert into t values ('n'),('N'); /* O */ insert into t values (0x006F); /* LATIN SMALL LETTER O */ insert into t values (0x004F); /* LATIN CAPITAL LETTER O */ insert into t values (0x00F2); /* LATIN SMALL LETTER O WITH GRAVE */ insert into t values (0x00D2); /* LATIN CAPITAL LETTER O WITH GRAVE */ insert into t values (0x1ECF); /* LATIN SMALL LETTER O WITH HOOK ABOVE */ insert into t values (0x1ECE); /* LATIN CAPITAL LETTER O WITH HOOK ABOVE */ insert into t values (0x00F5); /* LATIN SMALL LETTER O WITH TILDE */ insert into t values (0x00D5); /* LATIN CAPITAL LETTER O WITH TILDE */ insert into t values (0x00F3); /* LATIN SMALL LETTER O WITH ACUTE */ insert into t values (0x00D3); /* LATIN CAPITAL LETTER O WITH ACUTE */ insert into t values (0x1ECD); /* LATIN SMALL LETTER O WITH DOT BELOW */ insert into t values (0x1ECC); /* LATIN CAPITAL LETTER O WITH DOT BELOW */ /* O WITH CIRCUMFLEX */ insert into t values (0x00F4); /* LATIN SMALL LETTER O WITH CIRCUMFLEX */ insert into t values (0x00D4); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX */ insert into t values (0x1ED3); /* LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE */ insert into t values (0x1ED2); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND GRAVE */ insert into t values (0x1ED5); /* LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE */ insert into t values (0x1ED4); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE */ insert into t values (0x1ED7); /* LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE */ insert into t values (0x1ED6); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND TILDE */ insert into t values (0x1ED1); /* LATIN SMALL LETTER O WITH CIRCUMFLEX AND ACUTE */ insert into t values (0x1ED0); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND ACUTE */ insert into t values (0x1ED9); /* LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW */ insert into t values (0x1ED8); /* LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND DOT BELOW */ /* O WITH HORN */ insert into t values (0x01A1); /* LATIN SMALL LETTER O WITH HORN */ insert into t values (0x01A0); /* LATIN CAPITAL LETTER O WITH HORN */ insert into t values (0x1EDD); /* LATIN SMALL LETTER O WITH HORN AND GRAVE */ insert into t values (0x1EDC); /* LATIN CAPITAL LETTER O WITH HORN AND GRAVE */ insert into t values (0x1EDF); /* LATIN SMALL LETTER O WITH HORN AND HOOK ABOVE */ insert into t values (0x1EDE); /* LATIN CAPITAL LETTER O WITH HORN AND HOOK ABOVE */ insert into t values (0x1EE1); /* LATIN SMALL LETTER O WITH HORN AND TILDE */ insert into t values (0x1EE0); /* LATIN CAPITAL LETTER O WITH HORN AND TILDE */ insert into t values (0x1EDB); /* LATIN SMALL LETTER O WITH HORN AND ACUTE */ insert into t values (0x1EDA); /* LATIN CAPITAL LETTER O WITH HORN AND ACUTE */ insert into t values (0x1EE3); /* LATIN SMALL LETTER O WITH HORN AND DOT BELOW */ insert into t values (0x1EE2); /* LATIN CAPITAL LETTER O WITH HORN AND DOT BELOW */ /* P */ insert into t values ('p'),('P'); /* Q */ insert into t values ('q'),('Q'); /* R */ insert into t values ('r'),('R'); /* S */ insert into t values ('s'),('S'); /* T */ insert into t values ('t'),('T'); /* U */ insert into t values (0x0075); /* LATIN SMALL LETTER U */ insert into t values (0x0055); /* LATIN CAPITAL LETTER U */ insert into t values (0x00F9); /* LATIN SMALL LETTER U WITH GRAVE */ insert into t values (0x00D9); /* LATIN CAPITAL LETTER U WITH GRAVE */ insert into t values (0x1EE7); /* LATIN SMALL LETTER U WITH HOOK ABOVE */ insert into t values (0x1EE6); /* LATIN CAPITAL LETTER U WITH HOOK ABOVE */ insert into t values (0x0169); /* LATIN SMALL LETTER U WITH TILDE */ insert into t values (0x0168); /* LATIN CAPITAL LETTER U WITH TILDE */ insert into t values (0x00FA); /* LATIN SMALL LETTER U WITH ACUTE */ insert into t values (0x00DA); /* LATIN CAPITAL LETTER U WITH ACUTE */ insert into t values (0x1EE5); /* LATIN SMALL LETTER U WITH DOT BELOW */ insert into t values (0x1EE4); /* LATIN CAPITAL LETTER U WITH DOT BELOW */ /* U WITH HORN */ insert into t values (0x01B0); /* LATIN SMALL LETTER U WITH HORN */ insert into t values (0x01AF); /* LATIN CAPITAL LETTER U WITH HORN */ insert into t values (0x1EEB); /* LATIN SMALL LETTER U WITH HORN AND GRAVE */ insert into t values (0x1EEA); /* LATIN CAPITAL LETTER U WITH HORN AND GRAVE */ insert into t values (0x1EED); /* LATIN SMALL LETTER U WITH HORN AND HOOK ABOVE */ insert into t values (0x1EEC); /* LATIN CAPITAL LETTER U WITH HORN AND HOOK ABOVE */ insert into t values (0x1EEF); /* LATIN SMALL LETTER U WITH HORN AND TILDE */ insert into t values (0x1EEE); /* LATIN CAPITAL LETTER U WITH HORN AND TILDE */ insert into t values (0x1EE9); /* LATIN SMALL LETTER U WITH HORN AND ACUTE */ insert into t values (0x1EE8); /* LATIN CAPITAL LETTER U WITH HORN AND ACUTE */ insert into t values (0x1EF1); /* LATIN SMALL LETTER U WITH HORN AND DOT BELOW */ insert into t values (0x1EF0); /* LATIN CAPITAL LETTER U WITH HORN AND DOT BELOW */ /* V */ insert into t values ('v'),('V'); /* W */ insert into t values ('w'),('W'); /* X */ insert into t values ('x'),('X'); /* Y */ insert into t values (0x0079); /* LATIN SMALL LETTER Y */ insert into t values (0x0059); /* LATIN CAPITAL LETTER Y */ insert into t values (0x1EF3); /* LATIN SMALL LETTER Y WITH GRAVE */ insert into t values (0x1EF2); /* LATIN CAPITAL LETTER Y WITH GRAVE */ insert into t values (0x1EF7); /* LATIN SMALL LETTER Y WITH HOOK ABOVE */ insert into t values (0x1EF6); /* LATIN CAPITAL LETTER Y WITH HOOK ABOVE */ insert into t values (0x1EF9); /* LATIN SMALL LETTER Y WITH TILDE */ insert into t values (0x1EF8); /* LATIN CAPITAL LETTER Y WITH TILDE */ insert into t values (0x00FD); /* LATIN SMALL LETTER Y WITH ACUTE */ insert into t values (0x00DD); /* LATIN CAPITAL LETTER Y WITH ACUTE */ insert into t values (0x1EF5); /* LATIN SMALL LETTER Y WITH DOT BELOW */ insert into t values (0x1EF4); /* LATIN CAPITAL LETTER Y WITH DOT BELOW */ /* Z */ insert into t values ('z'),('Z'); Ordered List with Secondary Weights Treated As Primary Weights -------------------------------------------------------------- This section is informative. It describes a collation which we may add later. But it is not part of the requirement for WL#3865. There could be a second collation, where the collation strength is 2. That is, the secondary level should be taken into account not only for sorting, but for comparison as well. This applies only to the characters listed in the section "Creating an ordered list". The relevant accents, in order, are GRAVE, HOOK ABOVE, TILDE, ACUTE, DOT BELOW. So the list is in order if one ignores the differences in case. For example, A WITH ACUTE is greater than A WITH GRAVE but less than A WITH DOT BELOW. The final suffix in the collation name will be _w2a instead of _ci, If this occurs after WL#896 "Primary, Secondary and Tertiary Sorts", then we will have a _w2a ("Weight 2 Always") as described in WL#896. In that case, secondary-level ordering applies to all accents and probably can be generic (not specific for Vietnamese). New Rules --------- The original plan was to add these five collations: utf8_vietnamese_520_ci /* utf8 character set Vietnamese */ ucs2_vietnamese_520_ci /* ucs2 character set Vietnamese */ utf8mb4_vietnamese_520_ci /* utf8mb4 character set Vietnamese */ utf16_vietnamese_520_ci /* utf16 character set Vietnamese */ utf32_vietnamese_520_ci /* utf32 character set Vietnamese */ The section "Creating an ordered list" would still be correct, so all the characters in that list would still be in order. On 2010-11-01 these collations were eliminated from the plan. See dev-private email dated 2010-11-01 with subject line = WL#3865 Vietnamese collation. References ---------- BUG#4745 Add Vietnamese collation for the ucs2 and utf8 Unicode character sets Wikipedia entry for Vietnamese alphabet http://en.wikipedia.org/wiki/Vietnamese_alphabet "Vietnamese Unicode FAQs" http://vietunicode.sourceforge.net/ Unicode database http://www.unicode.org/Public/UNIDATA/UnicodeData.txt "Vietnamese Alphabetical System" http://vietunicode.sourceforge.net/charset/vietalphabet.html "Vietnamese Collation Chart" http://vietunicode.sourceforge.net/charset/v3.htm "Vietnamese UTF8 collation support" http://lists.mysql.com/internals/34036 http://lists.mysql.com/internals/34062 "Vietnamese is a complex language on Windows" http://blogs.msdn.com/michkap/archive/2005/08/27/457224.aspx NB: the name Quan Nguyen appears here and also in BUG#4745. dev-private email thread "Vietnamese collation [Was: MySQL contact]" [ mysql internal archive ] /mailarchive/mail.php?folder=4&mail=33763
The code will produce results as shown in the High Level Specification.
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.