MySQL Internals Manual  /  Important Algorithms and Structures  /  Character Sets and Collations

10.11 Character Sets and Collations

Character sets are used by MySQL when storing information, both to ensure that the information is stored (and returned) in the correct format, but also for the purposes of collation and sorting. Each character set supports one or more collations, and so these are collectively known as Collation Sets, rather than character sets.

Character sets are recorded against individual tables and returned as part of the field data. For example, the MYSQL_FIELD data type definition includes the field charsetnr:

typedef struct st_mysql_field {
  char *name;                 /* Name of column */
  char *org_name;             /* Original column name, if an alias */
  char *table;                /* Table of column if column was a field */
  char *org_table;            /* Org table name, if table was an alias */
  char *db;                   /* Database for table */
  char *catalog;              /* Catalog for table */
  char *def;                  /* Default value (set by mysql_list_fields) */
  unsigned long length;       /* Width of column (create length) */
  unsigned long max_length;   /* Max width for selected set */
  unsigned int name_length;
  unsigned int org_name_length;
  unsigned int table_length;
  unsigned int org_table_length;
  unsigned int db_length;
  unsigned int catalog_length;
  unsigned int def_length;
  unsigned int flags;         /* Div flags */
  unsigned int decimals;      /* Number of decimals in field */
  unsigned int charsetnr;     /* Character set */
  enum enum_field_types type; /* Type of field. See mysql_com.h for types */
} MYSQL_FIELD;

Character set and collation information are specific to a server version and installation, and are generated automatically from the sql/share/charsets/Index.xml file in the source distribution.

You can obtain a list of the available character sets configured within a server by running SHOW COLLATION, or by running a query on the INFORMATION_SCHEMA.COLLATION table. A sample of the information from that table has been provided here for reference.

Collation Id

Charset

Collation

Default

Sortlen

64

armscii8

armscii8_bin

??

1

32

armscii8

armscii8_general_ci

Yes

1

65

ascii

ascii_bin

??

1

11

ascii

ascii_general_ci

Yes

1

84

big5

big5_bin

??

1

1

big5

big5_chinese_ci

Yes

1

63

binary

binary

Yes

1

66

cp1250

cp1250_bin

??

1

44

cp1250

cp1250_croatian_ci

??

1

34

cp1250

cp1250_czech_cs

??

2

26

cp1250

cp1250_general_ci

Yes

1

50

cp1251

cp1251_bin

??

1

14

cp1251

cp1251_bulgarian_ci

??

1

52

cp1251

cp1251_general_cs

??

1

23

cp1251

cp1251_ukrainian_ci

??

1

51

cp1251

cp1251_general_ci

Yes

1

67

cp1256

cp1256_bin

??

1

57

cp1256

cp1256_general_ci

Yes

1

58

cp1257

cp1257_bin

??

1

29

cp1257

cp1257_lithuanian_ci

??

1

59

cp1257

cp1257_general_ci

Yes

1

80

cp850

cp850_bin

??

1

4

cp850

cp850_general_ci

Yes

1

81

cp852

cp852_bin

??

1

40

cp852

cp852_general_ci

Yes

1

68

cp866

cp866_bin

??

1

36

cp866

cp866_general_ci

Yes

1

96

cp932

cp932_bin

??

1

95

cp932

cp932_japanese_ci

Yes

1

69

dec8

dec8_bin

??

1

3

dec8

dec8_swedish_ci

Yes

1

98

eucjpms

eucjpms_bin

??

1

97

eucjpms

eucjpms_japanese_ci

Yes

1

85

euckr

euckr_bin

??

1

19

euckr

euckr_korean_ci

Yes

1

86

gb2312

gb2312_bin

??

1

24

gb2312

gb2312_chinese_ci

Yes

1

87

gbk

gbk_bin

??

1

28

gbk

gbk_chinese_ci

Yes

1

93

geostd8

geostd8_bin

??

1

92

geostd8

geostd8_general_ci

Yes

1

70

greek

greek_bin

??

1

25

greek

greek_general_ci

Yes

1

71

hebrew

hebrew_bin

??

1

16

hebrew

hebrew_general_ci

Yes

1

72

hp8

hp8_bin

??

1

6

hp8

hp8_english_ci

Yes

1

73

keybcs2

keybcs2_bin

??

1

37

keybcs2

keybcs2_general_ci

Yes

1

74

koi8r

koi8r_bin

??

1

7

koi8r

koi8r_general_ci

Yes

1

75

koi8u

koi8u_bin

??

1

22

koi8u

koi8u_general_ci

Yes

1

47

latin1

latin1_bin

??

1

15

latin1

latin1_danish_ci

??

1

48

latin1

latin1_general_ci

??

1

49

latin1

latin1_general_cs

??

1

5

latin1

latin1_german1_ci

??

1

31

latin1

latin1_german2_ci

??

2

94

latin1

latin1_spanish_ci

??

1

8

latin1

latin1_swedish_ci

Yes

1

77

latin2

latin2_bin

??

1

27

latin2

latin2_croatian_ci

??

1

2

latin2

latin2_czech_cs

??

4

21

latin2

latin2_hungarian_ci

??

1

9

latin2

latin2_general_ci

Yes

1

78

latin5

latin5_bin

??

1

30

latin5

latin5_turkish_ci

Yes

1

79

latin7

latin7_bin

??

1

20

latin7

latin7_estonian_cs

??

1

42

latin7

latin7_general_cs

??

1

41

latin7

latin7_general_ci

Yes

1

43

macce

macce_bin

??

1

38

macce

macce_general_ci

Yes

1

53

macroman

macroman_bin

??

1

39

macroman

macroman_general_ci

Yes

1

88

sjis

sjis_bin

??

1

13

sjis

sjis_japanese_ci

Yes

1

82

swe7

swe7_bin

??

1

10

swe7

swe7_swedish_ci

Yes

1

89

tis620

tis620_bin

??

1

18

tis620

tis620_thai_ci

Yes

4

90

ucs2

ucs2_bin

??

1

138

ucs2

ucs2_czech_ci

??

8

139

ucs2

ucs2_danish_ci

??

8

145

ucs2

ucs2_esperanto_ci

??

8

134

ucs2

ucs2_estonian_ci

??

8

146

ucs2

ucs2_hungarian_ci

??

8

129

ucs2

ucs2_icelandic_ci

??

8

130

ucs2

ucs2_latvian_ci

??

8

140

ucs2

ucs2_lithuanian_ci

??

8

144

ucs2

ucs2_persian_ci

??

8

133

ucs2

ucs2_polish_ci

??

8

131

ucs2

ucs2_romanian_ci

??

8

143

ucs2

ucs2_roman_ci

??

8

141

ucs2

ucs2_slovak_ci

??

8

132

ucs2

ucs2_slovenian_ci

??

8

142

ucs2

ucs2_spanish2_ci

??

8

135

ucs2

ucs2_spanish_ci

??

8

136

ucs2

ucs2_swedish_ci

??

8

137

ucs2

ucs2_turkish_ci

??

8

128

ucs2

ucs2_unicode_ci

??

8

35

ucs2

ucs2_general_ci

Yes

1

91

ujis

ujis_bin

??

1

12

ujis

ujis_japanese_ci

Yes

1

83

utf8

utf8_bin

??

1

202

utf8

utf8_czech_ci

??

8

203

utf8

utf8_danish_ci

??

8

209

utf8

utf8_esperanto_ci

??

8

198

utf8

utf8_estonian_ci

??

8

210

utf8

utf8_hungarian_ci

??

8

193

utf8

utf8_icelandic_ci

??

8

194

utf8

utf8_latvian_ci

??

8

204

utf8

utf8_lithuanian_ci

??

8

208

utf8

utf8_persian_ci

??

8

197

utf8

utf8_polish_ci

??

8

195

utf8

utf8_romanian_ci

??

8

207

utf8

utf8_roman_ci

??

8

205

utf8

utf8_slovak_ci

??

8

196

utf8

utf8_slovenian_ci

??

8

206

utf8

utf8_spanish2_ci

??

8

199

utf8

utf8_spanish_ci

??

8

200

utf8

utf8_swedish_ci

??

8

201

utf8

utf8_turkish_ci

??

8

192

utf8

utf8_unicode_ci

??

8

33

utf8

utf8_general_ci

Yes

1

Note that it is the collation ID, not the character set ID, that is used to identify the unique combination of character set and collation. Thus, when requesting character set information using one of the character set functions in mysys/charset.c, such as get_charset(), different IDs may return the same base character set, but a different collation set.

The following functions provide an internal interface to the collation and character set information, enabling you to access the information by name or ID:

static uint get_collation_number_internal(const char *name)
uint get_collation_number(const char *name)
uint get_charset_number(const char *charset_name, uint cs_flags)
const char *get_charset_name(uint charset_number)
static CHARSET_INFO *get_internal_charset(uint cs_number, myf flags)
CHARSET_INFO *get_charset(uint cs_number, myf flags)
CHARSET_INFO *get_charset_by_name(const char *cs_name, myf flags)
CHARSET_INFO *get_charset_by_csname(const char *cs_name,
                                    uint cs_flags,
                                    myf flags)

The table below details the functions, the key argument that is supplied, and the return value.

Function

Supplied Argument

Return Value

get_collation_number_internal()

Collation name

Collation ID

get_collation_number()

Collation name

Collation ID

get_charset_number()

Character set name

Collation ID

get_charset_name()

Collation ID

Character set name

get_internal_charset()

Collation ID

Character datatype

get_charset()

Collation ID

Character datatype


User Comments
Sign Up Login You must be logged in to post a comment.