MySQL :: A Tale of UDFs with Character Sets

If you are reading this blog post, you likely already know the purpose of user-defined functions (UDF) in MySQL. If you need a quick refresher on UDF’s take a look at the MySQL reference manual. You may have created your own UDFs as well. Have you ever run into character set related issues with your UDF? If yes, then this blog post will interest you. If you are planning to write new UDFs then it will be good to know this information. The initial design of MySQL UDF framework did not consider the character set of string arguments and return values. That would typically mean you would get your UDF string arguments and results in so-called “binary” character set. Even if you returned a string in a well-defined character set server would ignore the character set. We now have added the capability to the UDF framework such that you can read or set the character set of the UDF arguments. You can also convert the return value as you like to set its character set.

Let us understand through an example. For the sake of simplicity let us assume that you have the following table that has two columns. One column has character set ‘utf8mb4’ and the other has character set ‘latin1’. The table has one record. The string stored in each column is same. Of course the encoding of both strings is different as per their respective the character set. We could verify the same by checking the HEX and CHARSET of the columns.

mysql> CREATE TABLE test.country (utf varchar(40) CHARACTER SET utf8mb4,
    ->                            lt1 VARCHAR(40) CHARACTER SET latin1);
Query OK, 0 rows affected (0.06 sec)

mysql> INSERT INTO test.country(utf, lt1) 
    -> values('Österreich',  'Österreich');
Query OK, 1 row affected (0.01 sec)

mysql> SELECT HEX(utf), HEX(lt1), utf, lt1,
    -> CHARSET(utf), CHARSET(lt1)
    -> FROM test.country;

mysql> CREATE TABLE test.country (utf varchar(40) CHARACTER SET utf8mb4,

-> lt1 VARCHAR(40) CHARACTER SET latin1);

Query OK, 0 rows affected (0.06 sec)

mysql> INSERT INTO test.country(utf, lt1)

-> values('Österreich', 'Österreich');

Query OK, 1 row affected (0.01 sec)

mysql> SELECT HEX(utf), HEX(lt1), utf, lt1,

-> CHARSET(utf), CHARSET(lt1)

-> FROM test.country;

Here and elsewhere, you might consider using \G to produce “vertical” result set output to make query output easier to read.

+------------------------+----------------------+-------------+-------------+--------------+--------------+
| HEX(utf)               | HEX(lt1)             | utf         | lt1         | CHARSET(utf) | CHARSET(lt1) |
+------------------------+----------------------+-------------+-------------+--------------+--------------+
| C396737465727265696368 | D6737465727265696368 | Österreich  | Österreich  | utf8mb4      | latin1       |
+------------------------+----------------------+-------------+-------------+--------------+--------------+
1 row in set (0.00 sec)

+------------------------+----------------------+-------------+-------------+--------------+--------------+

+------------------------+----------------------+-------------+-------------+--------------+--------------+

+------------------------+----------------------+-------------+-------------+--------------+--------------+

1 row in set (0.00 sec)

Now suppose that you implemented the following UDF that takes two string arguments and returns a concatenated string of the two arguments. For the sake of simplicity, I have focused on the logic that conveys the idea. I did not add checking for validation and other error conditions.

bool add_strings_init(UDF_INIT *initid, UDF_ARGS *args, char *message) {
  size_t length = 0;
  for (unsigned int i = 0; i < args->arg_count; i++) {
    if (args->arg_type[i] != STRING_RESULT) {
      strcpy(message, "This UDF accepts only string arguments.");
      return true;
    }
    length += args->lengths[i];
  }
  initid->ptr = new char[length+1];
  initid->max_length = length;
  initid->maybe_null = 1;
  return false;
}

char *add_strings(UDF_INIT *initid, UDF_ARGS *args, char *result,
                  unsigned long *length, unsigned char *is_null,
                  unsigned char *error) {
  std::string arg1(args->args[0], args->lengths[0]);
  std::string arg2(args->args[1], args->lengths[1]);
  std::string add  = arg1 + arg2;
  strcpy(initid->ptr, add.c_str());
  *length = add.length();
  result = initid->ptr;
  *is_null = 0;
  *error = 0;
  return result;
}
void add_strings_deinit(UDF_INIT *initid) { delete[] initid->ptr; }

bool add_strings_init(UDF_INIT *initid, UDF_ARGS *args, char *message) {

size_t length = 0;

for (unsigned int i = 0; i < args->arg_count; i++) {

if (args->arg_type[i] != STRING_RESULT) {

strcpy(message, "This UDF accepts only string arguments.");

return true;

}

length += args->lengths[i];

}

initid->ptr = new char[length+1];

initid->max_length = length;

initid->maybe_null = 1;

return false;

}

char *add_strings(UDF_INIT *initid, UDF_ARGS *args, char *result,

unsigned long *length, unsigned char *is_null,

unsigned char *error) {

std::string arg1(args->args[0], args->lengths[0]);

std::string arg2(args->args[1], args->lengths[1]);

std::string add = arg1 + arg2;

strcpy(initid->ptr, add.c_str());

*length = add.length();

result = initid->ptr;

*is_null = 0;

*error = 0;

return result;

}

void add_strings_deinit(UDF_INIT *initid) { delete[] initid->ptr; }

The preceding UDF works well for ASCII characters. What would happen if you pass the two strings in different character sets?

When we concatenate both columns through the UDF, not surprisingly, it just concatenates the bytes in which both strings are represented in their respective character sets. The character set of the return value is ‘binary’, so the return value is not meaningful, as shown below.

mysql> SELECT add_strings(utf,lt1) AS string,
    -> HEX(add_strings(utf, lt1)) AS hex,
    -> CHARSET(add_strings(utf,lt1)) AS charset
    -> FROM test.country;

+-----------------------+--------------------------------------------+---------+
| string                | hex                                        | charset |
+-----------------------+--------------------------------------------+---------+
| Österreich▒sterreich   | C396737465727265696368D6737465727265696368 | binary  |
+-----------------------+--------------------------------------------+---------+
1 row in set (0.00 sec)

mysql> SELECT add_strings(utf,lt1) AS string,

-> HEX(add_strings(utf, lt1)) AS hex,

-> CHARSET(add_strings(utf,lt1)) AS charset

-> FROM test.country;

+-----------------------+--------------------------------------------+---------+

| string | hex | charset |

+-----------------------+--------------------------------------------+---------+

| Österreich▒sterreich | C396737465727265696368D6737465727265696368 | binary |

+-----------------------+--------------------------------------------+---------+

1 row in set (0.00 sec)

In MySQL 8.0.19, we have added a component service ‘mysql_udf_metadata’ to both detect the character set of the input arguments, and to select the desired character set of the UDF output. Let us change the previous implementation using the new component service. Now the UDF expects both of the arguments in the latin1 character set and returns the concatenated string in the utf8mb4 character set. Here is the modified version of the previous UDF.

#include <mysql/components/component_implementation.h>
#include <mysql/components/services/mysql_string.h>
#include <mysql/components/services/udf_metadata.h>

extern REQUIRES_SERVICE_PLACEHOLDER(mysql_udf_metadata);
extern REQUIRES_SERVICE_PLACEHOLDER(mysql_string_converter);
extern REQUIRES_SERVICE_PLACEHOLDER(mysql_string_factory);

/*
  Utility method to convert input buffer from one character set to  
  another using mysql_string_converter service.
*/
bool convert(const std::string &out_charset_name,
             const std::string &in_charset_name, const std::string 
             &in_buffer, size_t out_buffer_length, char *out_buffer) {
  my_h_string out_string = nullptr;
  if (mysql_service_mysql_string_factory->create(&out_string)) {
    return true;
  } else {
    mysql_service_mysql_string_factory->destroy(out_string);
    if (mysql_service_mysql_string_converter->convert_from_buffer(
            &out_string, in_buffer.c_str(), in_buffer.length(),
            in_charset_name.c_str())) {
      mysql_service_mysql_string_factory->destroy(out_string);
      return true;
    }
    if (mysql_service_mysql_string_converter->convert_to_buffer(
            out_string, out_buffer, out_buffer_length,
            out_charset_name.c_str())) {
      mysql_service_mysql_string_factory->destroy(out_string);
      return true;
    }
  }
  mysql_service_mysql_string_factory->destroy(out_string);
  return false;
}

char *latin1 = const_cast<char *>("latin1");
char *utf8mb4 = const_cast<char *>("utf8mb4");

bool add_strings_init(UDF_INIT *initid, UDF_ARGS *args, char *message) {
  size_t length = 0;
  for (unsigned int i = 0; i < args->arg_count; i++) {
    if (args->arg_type[i] != STRING_RESULT) {
      strcpy(message, "This UDF accepts only string arguments.");
      return true;
    }
    length += args->lengths[i];
    /* Inform the server to provide the arguments in the latin1 charset */
    if (mysql_service_mysql_udf_metadata->argument_set(
            args, "charset", i, static_cast<void *>(latin1))) {
      return true;
    }
  }
  /* Inform the server that charset of  return value will be utf8mb4 */
  if (mysql_service_mysql_udf_metadata->result_set(initid, "charset", utf8mb4))
    return true;
  length = length * 4 + 1;
  initid->ptr = new char[length];
  initid->max_length = length;
  initid->maybe_null = 1;
  return false;
}

char *add_strings(UDF_INIT *initid, UDF_ARGS *args, char *result,
                  unsigned long *length, unsigned char *is_null,
                  unsigned char *error) {
  std::string arg1(args->args[0], args->lengths[0]);
  std::string arg2(args->args[1], args->lengths[1]);
  /* Convert the first argument value from latin1 to utf8mb4 charset */
  if (convert(utf8mb4, latin1, arg1, initid->max_length, initid->ptr)) {
    *is_null = 1;
    *error = 1;
    return nullptr;
  }
  std::string final_str(initid->ptr);
  /* Convert the second argument value from latin1 to utf8mb4 charset */
  if (convert(utf8mb4, latin1, arg2, initid->max_length, initid->ptr)) {
    *is_null = 1;
    *error = 1;
    return nullptr;
  }

  final_str.append(initid->ptr);
  strcpy(initid->ptr, final_str.c_str());
  *length = final_str.length();
  result = initid->ptr;
  *is_null = 0;
  *error = 0;
  return result;
}
void add_strings_deinit(UDF_INIT *initid) { delete[] initid->ptr; }

#include <mysql/components/component_implementation.h>

#include <mysql/components/services/mysql_string.h>

#include <mysql/components/services/udf_metadata.h>

extern REQUIRES_SERVICE_PLACEHOLDER(mysql_udf_metadata);

extern REQUIRES_SERVICE_PLACEHOLDER(mysql_string_converter);

extern REQUIRES_SERVICE_PLACEHOLDER(mysql_string_factory);

Utility method to convert input buffer from one character set to

another using mysql_string_converter service.

bool convert(const std::string &out_charset_name,

const std::string &in_charset_name, const std::string

&in_buffer, size_t out_buffer_length, char *out_buffer) {

my_h_string out_string = nullptr;

if (mysql_service_mysql_string_factory->create(&out_string)) {

return true;

} else {

mysql_service_mysql_string_factory->destroy(out_string);

if (mysql_service_mysql_string_converter->convert_from_buffer(

&out_string, in_buffer.c_str(), in_buffer.length(),

in_charset_name.c_str())) {

mysql_service_mysql_string_factory->destroy(out_string);

return true;

}

if (mysql_service_mysql_string_converter->convert_to_buffer(

out_string, out_buffer, out_buffer_length,

out_charset_name.c_str())) {

mysql_service_mysql_string_factory->destroy(out_string);

return true;

}

mysql_service_mysql_string_factory->destroy(out_string);

return false;

}

char *latin1 = const_cast<char *>("latin1");

char *utf8mb4 = const_cast<char *>("utf8mb4");

bool add_strings_init(UDF_INIT *initid, UDF_ARGS *args, char *message) {

size_t length = 0;

for (unsigned int i = 0; i < args->arg_count; i++) {

if (args->arg_type[i] != STRING_RESULT) {

strcpy(message, "This UDF accepts only string arguments.");

return true;

}

length += args->lengths[i];

/* Inform the server to provide the arguments in the latin1 charset */

if (mysql_service_mysql_udf_metadata->argument_set(

args, "charset", i, static_cast<void *>(latin1))) {

return true;

}

/* Inform the server that charset of return value will be utf8mb4 */

if (mysql_service_mysql_udf_metadata->result_set(initid, "charset", utf8mb4))

return true;

length = length * 4 + 1;

initid->ptr = new char[length];

initid->max_length = length;

initid->maybe_null = 1;

return false;

}

char *add_strings(UDF_INIT *initid, UDF_ARGS *args, char *result,

unsigned long *length, unsigned char *is_null,

unsigned char *error) {

std::string arg1(args->args[0], args->lengths[0]);

std::string arg2(args->args[1], args->lengths[1]);

/* Convert the first argument value from latin1 to utf8mb4 charset */

if (convert(utf8mb4, latin1, arg1, initid->max_length, initid->ptr)) {

*is_null = 1;

*error = 1;

return nullptr;

}

std::string final_str(initid->ptr);

/* Convert the second argument value from latin1 to utf8mb4 charset */

if (convert(utf8mb4, latin1, arg2, initid->max_length, initid->ptr)) {

*is_null = 1;

*error = 1;

return nullptr;

}

final_str.append(initid->ptr);

strcpy(initid->ptr, final_str.c_str());

*length = final_str.length();

result = initid->ptr;

*is_null = 0;

*error = 0;

return result;

}

void add_strings_deinit(UDF_INIT *initid) { delete[] initid->ptr; }

Let us execute the UDF again on the same table as previously. Notice that we passed the first argument in utf8mb4 encoding and second argument in latin1. The UDF is able to handle the character set of both arguments. It returns the concatenated string as a well-formatted ‘utf8mb4’ encoded string.

mysql> select add_strings(utf,lt1) as string, 
    -> HEX(add_strings(utf, lt1)) as hex,
    -> CHARSET(add_strings(utf,lt1)) as charset from 
    -> test.country;
+------------------------+----------------------------------------------+---------+
| string                 | hex                                          | charset |
+------------------------+----------------------------------------------+---------+
| ÖsterreichÖsterreich   | C396737465727265696368C396737465727265696368 | utf8mb4 |
+------------------------+----------------------------------------------+---------+
1 row in set (0.00 sec)

mysql> select add_strings(utf,lt1) as string,

-> HEX(add_strings(utf, lt1)) as hex,

-> CHARSET(add_strings(utf,lt1)) as charset from

-> test.country;

+------------------------+----------------------------------------------+---------+

| string | hex | charset |

+------------------------+----------------------------------------------+---------+

| ÖsterreichÖsterreich | C396737465727265696368C396737465727265696368 | utf8mb4 |

+------------------------+----------------------------------------------+---------+

1 row in set (0.00 sec)

In the convert() method shown previously, the existing ‘mysql_string_converter’ component service converts the string from one character set (latin1) to another character set (utf8mb4).
This method is optional. It is required only if we want the return value to be in a character set other than the character set of the argument. If instead we want the return value to be in the same character set as the arguments in the example (that is, latin1), that is easily done: Set the desired character set of the arguments and return value in the add_strings_init() method, then append the strings as usual in the add_strings() method.

As you have seen it is easy to new character set aware UDFs. You may also easily upgrade existing UDFs as well.

Refer to the following source directories that implements the UDFs via component and plugin.

<source>\components\test\udf_services\
<source>\plugin\udf_services\

1 2	<source>\components\test\udf_services\ <source>\plugin\udf_services\

Refer to the following MTR tests that tests above component and plugin.

<source>\mysql-test\suite\test_services\t\test_udf_extension_services_plugin.test
<source>\mysql-test\suite\test_services\t\test_udf_extension_services_component.test

1 2	<source>\mysql-test\suite\test_services\t\test_udf_extension_services_plugin.test <source>\mysql-test\suite\test_services\t\test_udf_extension_services_component.test

Refer to WL#12370 if you want more details about the new capability of handling character sets in the UDF arguments and return values. We hope you find this capability helpful. Try it out and do let us know your feedback. We are listening.

Thank you for using MySQL!