UTF-8 (Unicode Transformation Format with 8-bit units) is an alternative way to store Unicode data. It is implemented according to RFC 3629, which describes encoding sequences that take from one to four bytes. Currently, MySQL support for UTF-8 does not include 4-byte sequences. (An older standard for UTF-8 encoding, RFC 2279, describes UTF-8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)
The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:
Basic Latin letters, digits, and punctuation signs use one byte.
Most European and Middle East script letters fit into a 2-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use 3-byte sequences.
Tip: To save space with
VARCHAR instead of
CHAR. Otherwise, MySQL must
reserve three bytes for each character in a
CHARACTER SET utf8 column because that is the
maximum possible length. For example, MySQL must reserve 30
bytes for a
CHAR(10) CHARACTER SET utf8
For additional information about data type storage, see
Section 11.7, “Data Type Storage Requirements”. For information about
InnoDB physical row storage, including how
InnoDB tables that use
COMPACT row format handle UTF-8
internally, see Section 184.108.40.206, “Physical Row Structure”.