MySQL implements several types of collations:
Simple collations for 8-bit character sets
This kind of collation is implemented using an array of 256
weights that defines a one-to-one mapping from character codes
latin1_swedish_ci is an example.
It is a case-insensitive collation, so the uppercase and
lowercase versions of a character have the same weights and they
compare as equal.
mysql> SET NAMES 'latin1' COLLATE 'latin1_swedish_ci'; Query OK, 0 rows affected (0.00 sec) mysql> SELECT 'a' = 'A'; +-----------+ | 'a' = 'A' | +-----------+ | 1 | +-----------+ 1 row in set (0.00 sec)
For implementation instructions, see Section 10.4.3, “Adding a Simple Collation to an 8-Bit Character Set”.
Complex collations for 8-bit character sets
This kind of collation is implemented using functions in a C source file that define how to order characters, as described in Section 10.3, “Adding a Character Set”.
Collations for non-Unicode multibyte character sets
For this type of collation, 8-bit (single-byte) and multibyte
characters are handled differently. For 8-bit characters,
character codes map to weights in case-insensitive fashion. (For
example, the single-byte characters
'A' both have a weight of
0x41.) For multibyte characters, there are
two types of relationship between character codes and weights:
Weights equal character codes.
sjis_japanese_ciis an example of this kind of collation. The multibyte character
'ぢ'has a character code of
0x82C0, and the weight is also
Character codes map one-to-one to weights, but a code is not necessarily equal to the weight.
gbk_chinese_ciis an example of this kind of collation. The multibyte character
'膰'has a character code of
0x81B0but a weight of
For implementation instructions, see Section 10.3, “Adding a Character Set”.
Collations for Unicode multibyte character sets
Some of these collations are based on the Unicode Collation Algorithm (UCA), others are not.
Non-UCA collations have a one-to-one mapping from character code
to weight. In MySQL, such collations are case insensitive and
utf8_general_ci is an
'á' each have
different character codes but all have a weight of
0x0041 and compare as equal.
mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci'; Query OK, 0 rows affected (0.00 sec) mysql> SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á'; +-----------+-----------+-----------+ | 'a' = 'A' | 'a' = 'À' | 'a' = 'á' | +-----------+-----------+-----------+ | 1 | 1 | 1 | +-----------+-----------+-----------+ 1 row in set (0.06 sec)
UCA-based collations in MySQL have these properties:
If a character has weights, each weight uses 2 bytes (16 bits).
A character may have zero weights (or an empty weight). In this case, the character is ignorable. Example: "U+0000 NULL" does not have a weight and is ignorable.
A character may have one weight. Example:
'a'has a weight of
A character may have many weights. This is an expansion. Example: The German letter
'ß'(SZ ligature, or SHARP S) has a weight of
Many characters may have one weight. This is a contraction. Example:
'ch'is a single letter in Czech and has a weight of
A many-characters-to-many-weights mapping is also possible (this is contraction with expansion), but is not supported by MySQL.
For implementation instructions, for a non-UCA collation, see Section 10.3, “Adding a Character Set”. For a UCA collation, see Section 10.4.4, “Adding a UCA Collation to a Unicode Character Set”.
There are also a few collations that do not fall into any of the previous categories.