In my blog MySQL 8.0 Collations: Migrating from older collations I showed a query that could identify the values that might break a unique constraint when migrate your data. That query was not very efficient due to the self join of the converted values.
Now that MySQL 8.0 RC1 (MySQL 8.0.3) is out we may use Window Functions to identify those columns. The new query is simpler, more elegant and will be much more efficient than the one in the original blog, and it lists all values that have some other value that breaks the unique constraint after a conversion to utf8mb4:
1 2 3 4 5 6 7 |
select name from (select name, count(name) over (partition by convert(name using utf8mb4)) as cnt from cities) as t where cnt > 1; |
How to write queries like this is described in Dag’s blog on Window Functions.
Using the example from the original blog, we get the this result:
1 2 3 4 5 6 7 8 |
mysql> select name from (select name, count(name) over (partition by convert(name using utf8mb4)) as cnt from cities) as t where cnt > 1; +---------+ | name | +---------+ | Orebro | | Örebro | +---------+ 2 rows in set (0.00 sec) |
A synthetic test with nearly 40k strings in a latin1 column with default collation and around 750 possible offending values in utf8mb4 with default collation, run on my desktop computer, show that the original query with the self join used nearly 5 seconds, while the window function version shown here used 0.10 seconds, 50 times faster.
Note that the two queries do not generate the same result, they just solve the same problem.
Thank you for using MySQL !