This section describes the collations available for Unicode character sets and their differentiating properties. For general information about Unicode, see Section 12.9, “Unicode Support”.
MySQL supports multiple Unicode character sets:
utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character. This character set is deprecated; please useutf8mb4instead.utf8: A deprecated alias forutf8mb3. Useutf8mb4instead.Noteutf8is expected in a future release to become an alias forutf8mb4.ucs2: The UCS-2 encoding of the Unicode character set using two bytes per character. Deprecated; expect support for this character set to be removed in a future version of MySQL.utf16: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Likeucs2but with an extension for supplementary characters.utf16le: The UTF-16LE encoding for the Unicode character set. Likeutf16but little-endian rather than big-endian.utf32: The UTF-32 encoding for the Unicode character set using four bytes per character.
          The utf8mb3 character set is deprecated and
          you should expect it to be removed in a future MySQL release.
          Please use utf8mb4 instead.
          utf8 is currently an alias for
          utf8mb3, but it is now deprecated as such,
          and utf8 is expected subsequently to become
          a reference to utf8mb4.
          utf8mb3 is also displayed in place of
          utf8 in columns of Information Schema
          tables, and in the output of SQL SHOW
          statements.
        
          To avoid ambiguity about the meaning of
          utf8, consider specifying
          utf8mb4 explicitly for character set
          references.
        utf8mb4, utf16,
        utf16le, and utf32 support
        Basic Multilingual Plane (BMP) characters and supplementary
        characters that lie outside the BMP. utf8mb3
        and ucs2 support only BMP characters.
      
        Most Unicode character sets have a general collation (indicated
        by _general in the name or by the absence of
        a language specifier), a binary collation (indicated by
        _bin in the name), and several
        language-specific collations (indicated by language specifiers).
        For example, for utf8mb4,
        utf8mb4_general_ci and
        utf8mb4_bin are its general and binary
        collations, and utf8mb4_danish_ci is one of
        its language-specific collations.
      
        Most character sets have a single binary collation.
        utf8mb4 is an exception that has two:
        utf8mb4_bin and
        utf8mb4_0900_bin. These two binary collations
        have the same sort order but are distinguished by their pad
        attribute and collating weight characteristics. See
        Collation Pad Attributes, and
        Character Collating Weights.
      
        Collation support for utf16le is limited. The
        only collations available are
        utf16le_general_ci and
        utf16le_bin. These are similar to
        utf16_general_ci and
        utf16_bin.
          MySQL implements the
          
          collations according to the Unicode Collation Algorithm (UCA)
          described at
          http://www.unicode.org/reports/tr10/. The
          collation uses the version-4.0.0 UCA weight keys:
          http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
          The
          xxx_unicode_ci
          collations have only partial support for the Unicode Collation
          Algorithm. Some characters are not supported, and combining
          marks are not fully supported. This affects languages such as
          Vietnamese, Yoruba, and Navajo. A combined character is
          considered different from the same character written with a
          single unicode character in string comparisons, and the two
          characters are considered to have a different length (for
          example, as returned by the
          xxx_unicode_ciCHAR_LENGTH() function or in
          result set metadata).
        
Unicode collations based on UCA versions higher than 4.0.0 include the version in the collation name. Examples:
utf8mb4_unicode_520_ciis based on UCA 5.2.0 weight keys (http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt),utf8mb4_0900_ai_ciis based on UCA 9.0.0 weight keys (http://www.unicode.org/Public/UCA/9.0.0/allkeys.txt).
          The LOWER() and
          UPPER() functions perform case
          folding according to the collation of their argument. A
          character that has uppercase and lowercase versions only in a
          Unicode version higher than 4.0.0 is converted by these
          functions only if the argument collation uses a high enough
          UCA version.
          Collations based on UCA 9.0.0 and higher are faster than
          collations based on UCA versions prior to 9.0.0. They also
          have a pad attribute of NO PAD, in contrast
          to PAD SPACE as used in collations based on
          UCA versions prior to 9.0.0. For comparison of nonbinary
          strings, NO PAD collations treat spaces at
          the end of strings like any other character (see
          Trailing Space Handling in Comparisons).
        
          To determine the pad attribute for a collation, use the
          INFORMATION_SCHEMA
          COLLATIONS table, which has a
          PAD_ATTRIBUTE column. For example:
        
mysql> SELECT COLLATION_NAME, PAD_ATTRIBUTE
       FROM INFORMATION_SCHEMA.COLLATIONS
       WHERE CHARACTER_SET_NAME = 'utf8mb4';
+----------------------------+---------------+
| COLLATION_NAME             | PAD_ATTRIBUTE |
+----------------------------+---------------+
| utf8mb4_general_ci         | PAD SPACE     |
| utf8mb4_bin                | PAD SPACE     |
| utf8mb4_unicode_ci         | PAD SPACE     |
| utf8mb4_icelandic_ci       | PAD SPACE     |
...
| utf8mb4_0900_ai_ci         | NO PAD        |
| utf8mb4_de_pb_0900_ai_ci   | NO PAD        |
| utf8mb4_is_0900_ai_ci      | NO PAD        |
...
| utf8mb4_ja_0900_as_cs      | NO PAD        |
| utf8mb4_ja_0900_as_cs_ks   | NO PAD        |
| utf8mb4_0900_as_ci         | NO PAD        |
| utf8mb4_ru_0900_ai_ci      | NO PAD        |
| utf8mb4_ru_0900_as_cs      | NO PAD        |
| utf8mb4_zh_0900_as_cs      | NO PAD        |
| utf8mb4_0900_bin           | NO PAD        |
+----------------------------+---------------+
          Comparison of nonbinary string values
          (CHAR, VARCHAR, and
          TEXT) that have a NO PAD
          collation differ from other collations with respect to
          trailing spaces. For example, 'a' and
          'a ' compare as different strings, not
          the same string. This can be seen using the binary collations
          for utf8mb4. The pad attribute for
          utf8mb4_bin is PAD
          SPACE, whereas for
          utf8mb4_0900_bin it is NO
          PAD. Consequently, operations involving
          utf8mb4_0900_bin do not add trailing
          spaces, and comparisons involving strings with trailing spaces
          may differ for the two collations:
        
mysql> CREATE TABLE t1 (c CHAR(10) COLLATE utf8mb4_bin);
Query OK, 0 rows affected (0.03 sec)
mysql> INSERT INTO t1 VALUES('a');
Query OK, 1 row affected (0.01 sec)
mysql> SELECT * FROM t1 WHERE c = 'a ';
+------+
| c    |
+------+
| a    |
+------+
1 row in set (0.00 sec)
mysql> ALTER TABLE t1 MODIFY c CHAR(10) COLLATE utf8mb4_0900_bin;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> SELECT * FROM t1 WHERE c = 'a ';
Empty set (0.00 sec)
MySQL implements language-specific Unicode collations if the ordering based only on the Unicode Collation Algorithm (UCA) does not work well for a language. Language-specific collations are UCA-based, with additional language tailoring rules. Examples of such rules appear later in this section. For questions about particular language orderings, http://unicode.org provides Common Locale Data Repository (CLDR) collation charts at http://www.unicode.org/cldr/charts/30/collation/index.html.
          For example, the nonlanguage-specific
          utf8mb4_0900_ai_ci and language-specific
          utf8mb4_
          Unicode collations each have these characteristics:
LOCALE_0900_ai_ci
The collation is based on UCA 9.0.0 and CLDR v30, is accent-insensitive and case-insensitive. These characteristics are indicated by
_0900,_ai, and_ciin the collation name. Exception:utf8mb4_la_0900_ai_ciis not based on CLDR because Classical Latin is not defined in CLDR.The collation works for all characters in the range [U+0, U+10FFFF].
If the collation is not language specific, it sorts all characters, including supplementary characters, in default order (described following). If the collation is language specific, it sorts characters of the language correctly according to language-specific rules, and characters not in the language in default order.
By default, the collation sorts characters having a code point listed in the DUCET table (Default Unicode Collation Element Table) according to the weight value assigned in the table. The collation sorts characters not having a code point listed in the DUCET table using their implicit weight value, which is constructed according to the UCA.
For non-language-specific collations, characters in contraction sequences are treated as separate characters. For language-specific collations, contractions might change character sorting order.
A collation name that includes a locale code or language name shown in the following table is a language-specific collation. Unicode character sets may include collations for one or more of these languages.
Table 12.3 Unicode Collation Language Specifiers
| Language | Language Specifier | 
|---|---|
| Bosnian | bs | 
            
| Bulgarian | bg | 
            
| Chinese | zh | 
            
| Classical Latin | la or roman | 
            
| Croatian | hr or croatian | 
            
| Czech | cs or czech | 
            
| Danish | da or danish | 
            
| Esperanto | eo or esperanto | 
            
| Estonian | et or estonian | 
            
| Galician | gl | 
            
| German phone book order | de_pb or german2 | 
            
| Hungarian | hu or hungarian | 
            
| Icelandic | is or icelandic | 
            
| Japanese | ja | 
            
| Latvian | lv or latvian | 
            
| Lithuanian | lt or lithuanian | 
            
| Mongolian | mn | 
            
| Norwegian / Bokmål | nb | 
            
| Norwegian / Nynorsk | nn | 
            
| Persian | persian | 
            
| Polish | pl or polish | 
            
| Romanian | ro or romanian | 
            
| Russian | ru | 
            
| Serbian | sr | 
            
| Sinhala | sinhala | 
            
| Slovak | sk or slovak | 
            
| Slovenian | sl or slovenian | 
            
| Modern Spanish | es or spanish | 
            
| Traditional Spanish | es_trad or spanish2 | 
            
| Swedish | sv or swedish | 
            
| Turkish | tr or turkish | 
            
| Vietnamese | vi or vietnamese | 
          MySQL provides the Bulgarian collations
          utf8mb4_bg_0900_ai_ci and
          utf8mb4_bg_0900_as_cs.
        
          Croatian collations are tailored for these Croatian letters:
          Č, Ć,
          Dž, Đ,
          Lj, Nj,
          Š, Ž.
        
          MySQL provides the
          utf8mb4_sr_latn_0900_ai_ci and
          utf8mb4_sr_latn_0900_as_cs collations for
          Serbian and the utf8mb4_bs_0900_ai_ci and
          utf8mb4_bs_0900_as_cs collations for
          Bosnian, when these languages are written with the Latin
          alphabet.
        
          MySQL provides collations for both major varieties of
          Norwegian: for Bokmål, you can use
          utf8mb4_nb_0900_ai_ci and
          utf8mb4_nb_0900_as_cs; for Nynorsk, MySQL
          now provides utf8mb4_nn_0900_ai_ci and
          utf8mb4_nn_0900_as_cs.
        
          For Japanese, the utf8mb4 character set
          includes utf8mb4_ja_0900_as_cs and
          utf8mb4_ja_0900_as_cs_ks collations. Both
          collations are accent-sensitive and case-sensitive.
          utf8mb4_ja_0900_as_cs_ks is also
          kana-sensitive and distinguishes Katakana characters from
          Hiragana characters, whereas
          utf8mb4_ja_0900_as_cs treats Katakana and
          Hiragana characters as equal for sorting. Applications that
          require a Japanese collation but not kana sensitivity may use
          utf8mb4_ja_0900_as_cs for better sort
          performance. utf8mb4_ja_0900_as_cs uses
          three weight levels for sorting;
          utf8mb4_ja_0900_as_cs_ks uses four.
        
          For Classical Latin collations that are accent-insensitive,
          I and J compare as
          equal, and U and V
          compare as equal. I and
          J, and U and
          V compare as equal on the base letter
          level. In other words, J is regarded as an
          accented I, and U is
          regarded as an accented V.
        
          MySQL provides collations for the Mongolian language when
          written with Cyrillic characters,
          utf8mb4_mn_cyrl_0900_ai_ci and
          utf8mb4_mn_cyrl_0900_as_cs.
        
          Spanish collations are available for modern and traditional
          Spanish. For both, ñ (n-tilde) is a
          separate letter between n and
          o. In addition, for traditional Spanish,
          ch is a separate letter between
          c and d, and
          ll is a separate letter between
          l and m.
        
          Traditional Spanish collations may also be used for Asturian
          and Galician. MySQL also provides
          utf8mb4_gl_0900_ai_ci and
          utf8mb4_gl_0900_as_cs collations for
          Galician. (These are the same collations as
          utf8mb4_es_0900_ai_ci and
          utf8mb4_es_0900_as_cs, respectively.)
        
Swedish collations include Swedish rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:
Ü = Y < Ö
          For any Unicode character set, operations performed using the
          
          collation are faster than those for the
          xxx_general_ci
          collation. For example, comparisons for the
          xxx_unicode_ciutf8mb4_general_ci collation are faster,
          but slightly less correct, than comparisons for
          utf8mb4_unicode_ci. The reason is that
          utf8mb4_unicode_ci supports mappings such
          as expansions; that is, when one character compares as equal
          to combinations of other characters. For example,
          ß is equal to ss in
          German and some other languages.
          utf8mb4_unicode_ci also supports
          contractions and ignorable characters.
          utf8mb4_general_ci is a legacy collation
          that does not support expansions, contractions, or ignorable
          characters. It can make only one-to-one comparisons between
          characters.
        
          To further illustrate, the following equalities hold in both
          utf8mb4_general_ci and
          utf8mb4_unicode_ci (for the effect of this
          in comparisons or searches, see
          Section 12.8.6, “Examples of the Effect of Collation”):
        
Ä = A
Ö = O
Ü = U
          A difference between the collations is that this is true for
          utf8mb4_general_ci:
        
ß = s
          Whereas this is true for
          utf8mb4_unicode_ci, which supports the
          German DIN-1 ordering (also known as dictionary order):
        
ß = ss
          MySQL implements language-specific Unicode collations if the
          ordering with utf8mb4_unicode_ci does not
          work well for a language. For example,
          utf8mb4_unicode_ci works fine for German
          dictionary order and French, so there is no need to create
          special utf8mb4 collations.
        
          utf8mb4_general_ci also is satisfactory for
          both German and French, except that ß is
          equal to s, and not to
          ss. If this is acceptable for your
          application, you should use
          utf8mb4_general_ci because it is faster. If
          this is not acceptable (for example, if you require German
          dictionary order), use utf8mb4_unicode_ci
          because it is more accurate.
        
          If you require German DIN-2 (phone book) ordering, use the
          utf8mb4_german2_ci collation, which
          compares the following sets of characters equal:
        
Ä = Æ = AE
Ö = Œ = OE
Ü = UE
ß = ss
          utf8mb4_german2_ci is similar to
          latin1_german2_ci, but the latter does not
          compare Æ equal to AE
          or Œ equal to OE. There
          is no utf8mb4_german_ci corresponding to
          latin1_german_ci for German dictionary
          order because utf8mb4_general_ci suffices.
A character's collating weight is determined as follows:
For all Unicode collations except the
_bin(binary) collations, MySQL performs a table lookup to find a character's collating weight.For
_bincollations exceptutf8mb4_0900_bin, the weight is based on the code point, possibly with leading zero bytes added.For
utf8mb4_0900_bin, the weight is theutf8mb4encoding bytes. The sort order is the same as forutf8mb4_bin, but much faster.
          Collating weights can be displayed using the
          WEIGHT_STRING() function. (See
          Section 14.8, “String Functions and Operators”.) If a collation uses a
          weight lookup table, but a character is not in the table (for
          example, because it is a “new” character),
          collating weight determination becomes more complex:
For BMP characters in general collations (
), the weight is the code point.xxx_general_ciFor BMP characters in UCA collations (for example,
and language-specific collations), the following algorithm applies:xxx_unicode_ciif (code >= 0x3400 && code <= 0x4DB5) base= 0xFB80; /* CJK Ideograph Extension */ else if (code >= 0x4E00 && code <= 0x9FA5) base= 0xFB40; /* CJK Ideograph */ else base= 0xFBC0; /* All other characters */ aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;The result is a sequence of two collating elements,
aaaafollowed bybbbb. For example:mysql> SELECT HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci)); +----------------------------------------------------------+ | HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci)) | +----------------------------------------------------------+ | FBC084CF | +----------------------------------------------------------+Thus,
U+04cf CYRILLIC SMALL LETTER PALOCHKA(ӏ) is, with all UCA 4.0.0 collations, greater thanU+04c0 CYRILLIC LETTER PALOCHKA(Ӏ). With UCA 5.2.0 collations, all palochkas sort together.For supplementary characters in general collations, the weight is the weight for
0xfffd REPLACEMENT CHARACTER. For supplementary characters in UCA 4.0.0 collations, their collating weight is0xfffd. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.An example with Deseret characters and
COUNT(DISTINCT):CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */ INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */ INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */ SELECT COUNT(DISTINCT s1) FROM t;The result is 2 because in the MySQL
collations, the replacement character has a weight ofxxx_unicode_ci0x0dc6, whereas Deseret Bee and Deseret Tee both have a weight of0xfffd. (Were theutf32_general_cicollation used instead, the result is 1 because all three characters have a weight of0xfffdin that collation.)An example with cuneiform characters and
WEIGHT_STRING():/* The four characters in the INSERT string are 00000041 # LATIN CAPITAL LETTER A 0001218F # CUNEIFORM SIGN KAB 000121A7 # CUNEIFORM SIGN KISH 00000042 # LATIN CAPITAL LETTER B */ CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0x000000410001218f000121a700000042); SELECT HEX(WEIGHT_STRING(s1)) FROM t;The result is:
0E33 FFFD FFFD 0E4A0E33and0E4Aare primary weights as in UCA 4.0.0.FFFDis the weight for KAB and also for KISH.The rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it is very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by the MySQL rule and secondarily by code point value, it is easy:
ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_binFor supplementary characters based on UCA versions higher than 4.0.0 (for example,
), supplementary characters do not necessarily all have the same collating weight. Some have explicit weights from the UCAxxx_unicode_520_ciallkeys.txtfile. Others have weights calculated from this algorithm:aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
          There is a difference between “ordering by the
          character's code value” and “ordering by the
          character's binary representation,” a difference that
          appears only with utf16_bin, because of
          surrogates.
        
          Suppose that utf16_bin (the binary
          collation for utf16) was a binary
          comparison “byte by byte” rather than
          “character by character.” If that were so, the
          order of characters in utf16_bin would
          differ from the order in utf8mb4_bin. For
          example, the following chart shows two rare characters. The
          first character is in the range
          E000-FFFF, so it is
          greater than a surrogate but less than a supplementary. The
          second character is a supplementary.
        
Code point  Character                    utf8mb4      utf16
----------  ---------                    -------      -----
0FF9D       HALFWIDTH KATAKANA LETTER N  EF BE 9D     FF 9D
10384       UGARITIC LETTER DELTA        F0 90 8E 84  D8 00 DF 84
          The two characters in the chart are in order by code point
          value because 0xff9d <
          0x10384. And they are in order by
          utf8mb4 value because
          0xef < 0xf0. But they
          are not in order by utf16 value, if we use
          byte-by-byte comparison, because 0xff >
          0xd8.
        
          So MySQL's utf16_bin collation is not
          “byte by byte.” It is “by code
          point.” When MySQL sees a supplementary-character
          encoding in utf16, it converts to the
          character's code-point value, and then compares. Therefore,
          utf8mb4_bin and
          utf16_bin are the same ordering. This is
          consistent with the SQL:2008 standard requirement for a
          UCS_BASIC collation: “UCS_BASIC is a collation in which
          the ordering is determined entirely by the Unicode scalar
          values of the characters in the strings being sorted. It is
          applicable to the UCS character repertoire. Since every
          character repertoire is a subset of the UCS repertoire, the
          UCS_BASIC collation is potentially applicable to every
          character set. NOTE 11: The Unicode scalar value of a
          character is its code point treated as an unsigned
          integer.”
        
          If the character set is ucs2, comparison is
          byte-by-byte, but ucs2 strings should not
          contain surrogates, anyway.