MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
MySQL 8.0 Collations: Migrating from older collations

From MySQL 8.0, utf8mb4 is the default character set, and the default collation for utf8mb4 is utf8mb4_0900_ai_ci. MySQL 8.0 is also coming with a whole new set of Unicode collations for the utf8mb4 character set.

This will allow use of the complete Unicode 9.0.0 character set in MySQL, and for new applications this is great news. But what about old applications? How painful will it be to modernize the character set usage for an existing application based on 5.7?

Introduction

First we need to understand how MySQL works with respect to character sets and collations:

Each schema (also known as a “database” in MySQL), has a default character set and a default collation. So all tables created in that schema will by default inherit the character set and collation from the schema.

To change this, run:

Each table has also a default character set and a default collation which is by default inherited by the columns in the table.  To change the default character set of a table run:

The change of the default character set/collation of a schema/table will not alter any existing columns or data. These two facts imply that if you upgrade from 5.7 to 8.0, and after the upgrade create no new schemata, MySQL will behave as before with regards to character sets and collations.

In addition, you need to know that for each character set in MySQL there is a default collation, and each collation implies a character set.

This means that in:

col1 get the collation utf8mb4_0900_ai_ci, and col2 get the character set utf8mb4.

So, to take advantage of the new collations for existing applications that have upgraded from 5.7, one need to do a migration project.

First, you need to analyze your schemata and figure out which columns should be in utf8mb4. Then you need to have a look at some properties of these columns.

Keys

InnoDB tables have a max key length of 3072 bytes. This value is dependent upon InnoDB row format and innodb_page_size, so you will need to check the documentation to figure out what your limit is if you don’t run with default settings.

A utf8mb4 character use 1-4 bytes, which means that the maximum length of a char/varchar column that is a key, will be 767 characters. So if you have key varchar/char columns with lengths larger than 767 characters you will have to consider either to shorten the length, change to TEXT or change the InnoDB settings.

Indices

When collation of a column changes, even if the character set stays the same, indices will be regenerated. For large data volumes, this may take considerable time and the table will be locked during index generation.

Unique constraints

Since collations are different, changing the character set of a table may break unique constraints (UNIQUE or PRIMARY KEY etc.). The simplest example is if you have create a character column with default settings in MySQL 5.7, this column will have the character set latin1 which again has the default collation latin1_swedish_ci. In this collation, ‘a’ != ‘å’, ‘a’ != ‘ä’ and ‘o’ != ‘ö’. If this column is changed to utf8mb4 which has the default collation utf8mb4_0900_ai_ci in MySQL 8.0, unique constraint may be broken, since in this collation, ‘a’=’å’=’ä’ and ‘o’=ö’. In this situation, you must consider which collation you really want. For Swedish data, an alternative may be utf8mb4_sv_0900_ai_ci, which share many properties with latin1_swedish_ci. But the correct solution will depend on your actual data set and the needs of your application.

To check whether a column v with a unique constraint in a table t is convertible to utf8mb4 with some collation, you may do:

If the result is larger than zero, you have to make changes to your data or pick a more suitable collation for your data.

To check which value pair cause problems, you may do:

NO PAD collations

The new utf8mb4 Unicode 9.0.0 collations are NO PAD collations (as opposed to the old collations which are PAD SPACE collations). This means that trailing spaces in strings will be significant for VARCHAR columns. Check out which MySQL collations are PAD SPACE and which are NO PAD by doing:

This may have implications depending on your application and the stored data.

Small example

Let us assume that we have the following table after an upgrade from 5.7:

If we do our checks on this table, we get:

And if you just try to convert the column, you get as expected:

There are several possible ways from here:

  1. You can assume that this is Swedish data, and convert to utf8mb4_sv_0900_ai_ci (Since the original had latin1_swedish_ci) or another suitable collation, depending on the application and the data.
  2. It may be that there are data that are incorrect (the table should not have one entry for ‘Örebro’ and another for ‘Orebro’). In that case you need to clean up you data before you convert.

We will go for alternative and 2 and clean up the data.  Then we are ready to convert, or?

Well, that was the key length that hit us. But cities with that long names are not that common, so in our case, we cut down the length of the name to the maximum 768 characters which is well above what we need.

Now that the table is ready for data that is not Latin-1, so can insert data for 東京 (Tokyo):

Conclusion

This blog is by no means an exhaustive manual for migration, and the examples may not scale for huge data volumes, but I hope I have shed some light on issues you may get entangled in when you migrate to utf8mb4.

Please also check out our earlier posts outlining our improvements to utf8 support in MySQL 8.0, starting with our Introduction post on Sushi = Beer.