MySQL :: Debugging Character Set Issues by Example

In a world moving towards Unicode and UTF-8, a lot of applications still use some one-byte character set. And since one-byte characters usually accepts any byte in the range 0x00-0xFF it often works well to store and retrieve any data in such character strings, e.g. UTF-8 encoded data. At least as long as you don’t expect any semantics with regards to ordering and comparisons that will not cause any apparent problems.

The MySQL latin1 character set is such a one-byte character set for Western Europe, and it is the default character set of MySQL up to and including 5.7. In spite of the name, the character set is actually Windows-1252 compliant, which is a superset of ISO-8859-1, also known as Latin-1. Correctly set up, MySQL client and server will convert characters from the environment to the correct encoding when the data is sent to the server and vice versa. But if you get your settings wrong, this character set will allow any byte value and this may cause any gibberish to be accepted as valid data without any error or warning.

Example:

mysql> create table foo (i int, v varchar(10) charset latin1);
Query OK, 0 rows affected (0.01 sec)

mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into foo values(1, 'é');
Query OK, 1 row affected (0.00 sec)

mysql> select * from foo;
+------+------+
| i    | v    |
+------+------+
| 1    | é    |
+------+------+
1 row in set (0.00 sec)

mysql> create table foo (i int, v varchar(10) charset latin1);

Query OK, 0 rows affected (0.01 sec)

mysql> set names latin1;

Query OK, 0 rows affected (0.00 sec)

mysql> insert into foo values(1, 'é');

Query OK, 1 row affected (0.00 sec)

mysql> select * from foo;

+------+------+

| i | v |

+------+------+

| 1 | é |

+------+------+

1 row in set (0.00 sec)

This looks completely ok, but it isn’t. Since I did set names latin1; and I did this in an UTF-8 environment I had an discrepancy between my environment and my MySQL settings. This is easily revealed by the following query:

mysql> select i,v,hex(v) from foo;
+------+------+--------+
| i    | v    | hex(v) |
+------+------+--------+
| 1    | é    | C3A9   |
+------+------+--------+
1 row in set (0.00 sec)

mysql> select i,v,hex(v) from foo;

+------+------+--------+

| i | v | hex(v) |

+------+------+--------+

| 1 | é | C3A9 |

+------+------+--------+

1 row in set (0.00 sec)

0xC3A9 is not the latin1 encoding of ‘é’, it is the UTF-8 encoding of ‘é’. The correct value is 0xE9. The problem comes from the fact that when I in my environment types ‘é’, the bytes 0xC3A9 are generated and since I have insisted on this being latin1 (set names latin1;) it is interpreted as a perfect valid latin1 character sequence of two characters: 0xC3 and 0xA9.

Now, if we do the correct setting for my environment and try again:

mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into foo values(2, 'é');
Query OK, 1 row affected (0.00 sec)

mysql> select i,v,hex(v) from foo;
+------+------+--------+
| i    | v    | hex(v) |
+------+------+--------+
| 1    | Ã©   | C3A9   |
| 2    | é    | E9     |
+------+------+--------+
2 rows in set (0.00 sec)

mysql> set names utf8mb4;

Query OK, 0 rows affected (0.00 sec)

mysql> insert into foo values(2, 'é');

Query OK, 1 row affected (0.00 sec)

mysql> select i,v,hex(v) from foo;

+------+------+--------+

| i | v | hex(v) |

+------+------+--------+

| 1 | Ã© | C3A9 |

| 2 | é | E9 |

+------+------+--------+

2 rows in set (0.00 sec)

As we see, in the second row, my environment’s two UTF-8 bytes for ‘é’ have been correctly converted to the latin1 encoded value 0xE9, while the first row I inserted we can see is garbage.

Fixing it

If you have a case where this is the case consistently over all the rows, it is easily fixable. Given this situation when column v is latin1:

mysql> select i,v,hex(v) from foo;
+------+------+--------+
| i    | v    | hex(v) |
+------+------+--------+
|    1 | Ã©   | C3A9   |
+------+------+--------+
1 row in set (0.00 sec)

mysql> select i,v,hex(v) from foo;

+------+------+--------+

| i | v | hex(v) |

+------+------+--------+

| 1 | Ã© | C3A9 |

+------+------+--------+

1 row in set (0.00 sec)

You may convert this to proper correct latin1 encoding by converting it to the binary character set, then to what the data is (utf8mb4) and the to the desired character set (latin1):

mysql> update foo set v=convert(convert(convert(v using binary) using utf8mb4) using latin1) ;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> select i,v,hex(v) from foo;
+------+------+--------+
| i    | v    | hex(v) |
+------+------+--------+
| 1    | é    | E9     |
+------+------+--------+
1 row in set (0.00 sec)

mysql> update foo set v=convert(convert(convert(v using binary) using utf8mb4) using latin1) ;

Query OK, 1 row affected (0.00 sec)

Rows matched: 1 Changed: 1 Warnings: 0

mysql> select i,v,hex(v) from foo;

+------+------+--------+

| i | v | hex(v) |

+------+------+--------+

| 1 | é | E9 |

+------+------+--------+

1 row in set (0.00 sec)

Of course, if you have a mix of encodings, this may still be done, but on a row-by-row basis and it may turn out to be a very tedious process.

Warning: The above update will succeed, even if there are non-UTF-8 data in the table and also if there are UTF-8 encoded characters that do not map to latin1:

Characters that do not map to latin1 wil be converted to ‘?’
non-UTF-8 data will be converted to NULL and cause a warning

The above may only be used when you wish to stick to latin1 and your data is in the range of latin1 characters. But, It might be that you also have data that is not in the latin1 character set, e.g. Chinese, Japanese or Cyrillic letters. Then you may alter the table and change the character set of the column to an appropriate character set – utf8mb4 – without altering the data:

mysql> select i,v,hex(v) from foo;
+------+--------+--------+
| i    | v      | hex(v) |
+------+--------+--------+
| 1    | Ã©     | C3A9   |
| 2    | Ð”     | D094   |
| 3    | ã°     | E381B0 |
+------+--------+--------+
3 rows in set (0.00 sec)

mysql> alter table foo modify column v varchar(10) charset binary; 
Query OK, 3 rows affected (0.02 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> alter table foo modify column v varchar(10) charset utf8mb4; 
Query OK, 3 rows affected (0.03 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> select i,v,hex(v) from foo;
+------+------+--------+
| i    | v    | hex(v) |
+------+------+--------+
| 1    | é    | C3A9   |
| 2    | Д    | D094   |
| 3    | ば   | E381B0 |
+------+------+--------+
3 rows in set (0.00 sec)

mysql> select i,v,hex(v) from foo;

+------+--------+--------+

| i | v | hex(v) |

+------+--------+--------+

| 2 | Ð” | D094 |

| 3 | ã° | E381B0 |

+------+--------+--------+

3 rows in set (0.00 sec)

mysql> alter table foo modify column v varchar(10) charset binary;

Query OK, 3 rows affected (0.02 sec)

Records: 3 Duplicates: 0 Warnings: 0

mysql> alter table foo modify column v varchar(10) charset utf8mb4;

Query OK, 3 rows affected (0.03 sec)

Records: 3 Duplicates: 0 Warnings: 0

mysql> select i,v,hex(v) from foo;

+------+------+--------+

| i | v | hex(v) |

+------+------+--------+

| 1 | é | C3A9 |

| 2 | Д | D094 |

| 3 | ば | E381B0 |

+------+------+--------+

3 rows in set (0.00 sec)

As we see, we do that by altering the character set of column v two times: First to binary, and the to the desired character set utf8mb4.

Conclusion

MySQL 8.0 has utf8mb4 as default character set and is as such a much better fit to the modern world of Unicode characters, but as we know, there is a lot of legacy stuff out in the real world. Before you start migrating your data to MySQL 8.0 and Unicode, be sure that your environment and your client agree on the character set in use. It may not be correct even if your application get the same data from the database as you put into it. And if you then in this situation migrate your data to the new utf8mb4 collations, you might really run into trouble, so you will need to fix your data first.