MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
A Tale of UDFs with Character Sets

If you are reading this blog post, you likely already know the purpose of user-defined functions (UDF) in MySQL. If you need a quick refresher on UDF’s take a look at the MySQL reference manual.  You may have created your own UDFs as well. Have you ever run into character set related issues with your UDF?  If yes, then this blog post will interest you. If you are planning to write new UDFs then it will be good to know this information. The initial design of MySQL UDF framework did not consider the character set of string arguments and return values. That would typically mean you would get your UDF string arguments and results in so-called “binary” character set. Even if you returned a string in a well-defined character set server would ignore the character set. We now have added the capability to the UDF framework such that you can read or set the character set of the UDF arguments. You can also convert the return value as you like to set its character set.

Let us understand through an example. For the sake of simplicity let us assume that you have the following table that has two columns. One column has character set ‘utf8mb4’ and the other has character set ‘latin1’. The table has one record. The string stored in each column is same. Of course the encoding of both strings is different as per their respective the character set. We could verify the same by checking the HEX and CHARSET of the columns.

Here and elsewhere, you might consider using \G to produce “vertical” result set output to make query output easier to read.

Now suppose that you implemented the following UDF that takes two string arguments and returns a concatenated string of the two arguments. For the sake of simplicity, I have focused on the logic that conveys the idea. I did not add checking for validation and other error conditions.

 

The preceding UDF works well for ASCII characters. What would happen if you pass the two strings in different character sets?

When we concatenate both columns through the UDF, not surprisingly, it just concatenates the bytes in which both strings are represented in their respective character sets. The character set of the return value is ‘binary’, so the return value is not meaningful, as shown below.

In MySQL 8.0.19, we have added a component service ‘mysql_udf_metadata’ to both detect the character set of the input arguments, and to select the desired character set of the UDF output. Let us change the previous implementation using the new component service.  Now the UDF expects both of the arguments in the latin1 character set and returns the concatenated string in the utf8mb4 character set. Here is the modified version of the previous UDF.

Let us execute the UDF again on the same table as previously.  Notice that we passed the first argument in utf8mb4 encoding and second argument in latin1. The UDF is able to handle the character set of both arguments. It returns the concatenated string as a well-formatted ‘utf8mb4’ encoded string.

In the convert() method shown previously, the existing ‘mysql_string_converter’ component service converts the string from one character set (latin1) to another character set (utf8mb4).
This method is optional. It is required only if we want the return value to be in a character set other than the character set of the argument. If instead we want the return value to be in the same character set as the arguments in the example (that is, latin1), that is easily done: Set the desired character set of the arguments and return value in the add_strings_init() method, then append the strings as usual in the add_strings() method.

As you have seen it is easy to new character set aware UDFs. You may also easily upgrade existing UDFs as well.

Refer to the following source directories that implements the UDFs via component and plugin.

Refer to the following MTR tests that tests above component and plugin.

Refer to WL#12370 if you want more details about the new capability of handling character sets in the UDF arguments and return values. We hope you find this capability helpful. Try it out and do let us know your feedback. We are listening.

Thank you for using MySQL!