MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
Debugging Character Set Issues by Example

In a world moving towards Unicode and UTF-8, a lot of applications still use some one-byte character set. And since one-byte characters usually accepts any byte in the range 0x00-0xFF it often works well to store and retrieve any data in such character strings, e.g. UTF-8 encoded data. At least as long as you don’t expect any semantics with regards to ordering and comparisons that will not cause any apparent problems.

The MySQL latin1 character set is such a one-byte character set for Western Europe, and it is the default character set of MySQL up to and including 5.7. In spite of the name, the character set is actually Windows-1252 compliant, which is a superset of ISO-8859-1, also known as Latin-1.  Correctly set up, MySQL client and server will convert characters from the environment to the correct encoding when the data is sent to the server and vice versa. But if you get your settings wrong, this character set will allow any byte value and  this may cause any gibberish to be accepted as valid data without any error or warning.

Example:

This looks completely ok, but it isn’t. Since I did set names latin1; and I did this in an UTF-8 environment I had an discrepancy between my environment and my MySQL settings. This is easily revealed by the following query:

0xC3A9 is not the latin1 encoding of ‘é’, it is the UTF-8 encoding of ‘é’. The correct value is 0xE9. The problem comes from the fact that when I in my environment types ‘é’, the bytes 0xC3A9 are generated and since I have insisted on this being latin1 (set names latin1;) it is interpreted as a perfect valid latin1 character sequence of two characters: 0xC3 and 0xA9.

Now, if we do the correct setting for my environment and try again:

As we see, in the second row, my environment’s two UTF-8 bytes for ‘é’ have been correctly converted to the latin1 encoded value 0xE9, while the first row I inserted we can see is garbage.

Fixing it

If you have a case where this is the case consistently over all the rows, it is easily  fixable. Given this situation when column v is latin1:

You may convert this to proper correct latin1 encoding by converting it to the binary character set, then to what the data is (utf8mb4) and the to the desired character set (latin1):

Of course, if you have a mix of encodings, this may still be done, but on a row-by-row basis and it may turn out to be a very tedious process.

Warning: The above update will succeed, even if there are non-UTF-8 data in the table and also if there are UTF-8 encoded characters that do not map to latin1:

  • Characters that do not map to latin1 wil be converted to ‘?’
  • non-UTF-8 data will be converted to NULL and cause a warning

The above may only be used when you wish to stick to latin1 and your data is in the range of latin1 characters. But, It might be that you also have data that is not in the latin1 character set, e.g. Chinese, Japanese or Cyrillic letters. Then you may alter the table and change the character set of the column to an appropriate character set – utf8mb4 – without altering the data:

As we see, we do that by altering the character set of column v two times: First to binary, and the to the desired character set utf8mb4.

Conclusion

MySQL 8.0 has utf8mb4 as default character set and is as such a much better fit to the modern world of Unicode characters, but as we know, there is a lot of legacy stuff out in the real world. Before you start migrating your data to MySQL 8.0 and Unicode, be sure that your environment and your client agree on the character set in use. It may not be correct even if your application get the same data from the database as you put into it. And if you then in this situation migrate your data to the new utf8mb4 collations, you might really run into trouble, so you will need to fix your data first.