WL#12370: Extend the UDF API to handle properly character sets of the arguments and the return value
Affects: Server-8.0
—
Status: Complete
Currently the UDF API doesn't really handle character sets properly. There are following two problems : 1. The String UDF expected to return a "char *" encoded into the character set of the return argument (as declared in CREATE FUNCTION ... SONAME). But the UDF doesn't really know what character set of the return value is. 2. The string arguments are expected to receive a "char *" encoded in character set perceived by the server but the UDF doesn't really know what is character set of arguments. So currently there's no guarantee that anything but US ASCII can be handled reliably and predictably as a string argument. Aim of this worklog is to address above two problems with the UDF as following: 1. UDF should be able to specify the character set of the returning "char *". UDF() should be able convert the return value in the specified character set. UDF() may use string component services for converting the return value into the charset specified. 2. UDF should be able to know the character set name of each argument as determined by the server.
Functional Requirements : ------------------------- Legend : (a) User : UDF author who wrote the UDF(s). FR01 : User must be able to read following extension attributes of UDF arguments as well as of return value. (a) charset :- Character set name (b) collation :- Collation name FR02 : User must be able to set the following extension attributes of UDF arguments as well as of return value. (a) charset :- Character set name (b) collation :- Collation name FR02.1 : User must set extension attributes which are supported by the server. For instance: (1) charset name : utf8mb4, cp1250 etc. (2) collation name : utf8mb4_general_co, cp1250_czech_cs etc. Note : Execute SHOW CHARACTER SET; statement to check charset and collation names supported by the server. FR03 : At the UDF preparation time(i.e. when udf's init() method is called), if the user specifies wrong character set name then UDF initialization must fail with appropriate error. FR04 : At the UDF run time(i.e. When actual UDF executes), if the character set of either of argument is found to be different then specified by the user at preparation time then UDF framework must convert the argument value in the character set specified by the user at preparation time. FR05 : User must perform the character set conversion of the return value on his own. User may user string component services to perform the conversion. Non-Functional Requirements : ----------------------------- NFR-01 : It must be possible to scale the extension argument in the UDF_INIT and UDG_ARGS structure to add more capabilities in future. NFR-02 : Changes proposed in this worklog must not break existing UDFs.
Legend : - user - UDF author, who writes the UDF(s). - udf_init() - The init method for the udf defined by the user. - udf() - Actual udf function defined by the user. - udf_deinit() - The deinit method defined by the user. HS01: Possible solutions ------------------------- There could be the following possible approaches: 1. Ensure that an explicit conversion is made at call time from/to the declared character sets for each argument and the return value in the udf_init(). Let the udf() deal in the arguments in specified character sets. Pros: Easy for the UDF authors. They will just need to specify character sets of argument and return value. Cons: A performance penalty of Strings conversion to/from specified character sets. 2. Pass down the character set name of each argument to the UDF framework and let it decide to convert the arguments value from to what character set. Pros: Makes UDFs similar to built-in functions performance wise as it avoids the extra transformations and allows UDFs to deal with string the way the rest of the server does. Cons: Increased complexity for UDF authors. They either need to convert themselves or use the likes of the string component service to deal with strings in varying encodings. ┌─────────────────────────────────────────────────────────────────────────────┐ │ This worklog aims at providing behavior similar to approach#1 using best of │ │ both approaches. It will provide the API(s) to the user to specify or read │ │ the char_set_names of arguments and return value. │ │ Server will provide the converted arguments to the udf(). Udf author may │ │ convert the return value in the specified character set in the udf(). │ └─────────────────────────────────────────────────────────────────────────────┘ HS02 : If we extend the UDF APIs to pass down the information then we could break ABI compatibility. Luckily WL#2872 added an extension argument in the UDF_ARGS and UDF_INIT. This argument is currently not used so we can use it to pass the extra info in a reliable way without breaking the ABI compatibility. We shall add interface functions for handling it. UDF author will be able to set or retrieve the value to & from these arguments using those methods. HS03: API specifications : -------------------------- HS03.1 A new server component service "mysql_udf_metadata" will be added. This service will provide the methods related to UDFs metadata. HS03.2 Above component service will provide following udf metadata methods in order to get or set extension argument of UDF arguments or return value. (1) argument_get : To retrieve the extension attributes of UDF arguments. (2) argument_set : To set the extension attributes of UDF arguments. (3) result_get : To retrieve the extension attributes of the return value. (4) result_set : To set the extension attributes of UDF arguments. HS03.3 API that retrieves the extension attribute of a UDF argument. argument_get(UDF_ARGS * udf_args /* Handle of UDF_ARGS structure*/, const char *extension_type /* Type of extension attribute to get */, unsigned int index /* Index of UDF argument of which extension attribute to be fetched */, void **out_value /* Retrieved value */); One could retrieve the charset of first UDF argument as following. void *out_value = nullptr; const unsigned index = 0; my_serviceservice("mysql_udf_metadata", mysql_plugin_registry_acquire()); service->argument_get(udf_args, "charset", index, &out_value); const char *charset_name = static_cast (out_value); HD03.4 API that sets the extension attribute of a UDF argument argument_set(UDF_ARGS * udf_args /* Handle of UDF_ARGS structure */, const char *extension_type /* Type of extension attribute to set */, unsigned int index /* Index of UDF argument of which extension attribute to be set*/, void *in_value /* Value to be set */); One could set the charset of first UDF argument as following. const char* name = "utf8mb4"; char *value = const_cast (name); my_service service("mysql_udf_metadata", mysql_plugin_registry_acquire()); service->argument_set(udf_args, "charset", 0, static_cast (value)); HS03.5 API that retrieves the extension attribute of a UDF return value. result_get(UDF_INIT * udf_init /* Handle of UDF_INIT structure */, const char *extension_type /* Type of extension attribute to be retrieved */, void **out_value /* Retrieved value */); One could retrieve the charset of UDF return value as following. void *out_value = nullptr; my_service service("mysql_udf_metadata", mysql_plugin_registry_acquire()); service->result_get(udf_init, "charset", 0, &out_value); const char *charset_name = static_cast (out_value); HS03.6 API that sets the extension attribute of a UDF argument result_set(UDF_INIT * udf_init /* Handle of UDF_INIT structure*/, const char *extension_type /* Type of extension attribute to be retrieved */, void *in_value /* Value to be set */); One could retrieve the charset of UDF return value as following. const char* name = "utf8mb4"; char *value = const_cast (name); my_service service("mysql_udf_metadata", mysql_plugin_registry_acquire()); service->result_set(udf_init, "charset", 0, static_cast (value)); HS03.7 The argument name 'extension_type' in the APIs above is chosen to be of char* instead of an enum because of compatibility reasons. It will be easier to deprecate and remove an argument without breaking existing UDFs. Above APIs will do case insensitive comparison for argument name 'extension_type'. HS03.8 The extension pointer in the UDF_ARGS structure will point to the following structure. We could add members to the structure in case we decide to add more extension arguments to the UDF argument. This structure is opaque to the UDF users, it is accessible through the APIs listed above. struct Udf_args_extension { Udf_args_extension() : charset_info() {} const CHARSET_INFO **charset_info; }; HS03.9 The extension pointer in the UDF_INIT structure will point to the following structure. We could add members to the structure in case we decide to add more extension arguments to the return value. This structure is opaque to the UDF users, it is accessible through the APIs listed above. struct Udf_return_value_extension { Udf_return_value_extension(const CHARSET_INFO *charset_info = nullptr) : charset_info(charset_info) {} const CHARSET_INFO *charset_info; }; HS04: Usage specifications -------------------------- - The UDF arguments are evaluated right before udf_init() method is called. That means user could read the extension arguments of any of UDF arguments in the udf_init() method by calling the my_get_extension_arg() API as explained in the previous section. - If user wishes to change the default values of extension argument of any of the UDF argument then user can do that by calling my_set_extension_arg() API as explained in the previous section. - If character set or collation type of extension argument of any of the UDF argument(s) is set other then default value then server will provide converted method arguments to the udf() at the time of execution. The argument(s) will be converted into the character set as it was set by the user during udf_init(). - User could retrieve the extension arguments of the return value by calling my_get_extension_result_arg() API as explained in the previous section. - User could set the extension argument of return value by calling my_set_extension_result_arg() API. - If user converts the return value from default charset/collation to any other charset/collation then user must set the changed name in the extension argument.
Background : ------------ Following details are for information purpose only. - The UDF execution is handled either by Item_udf_func or its derived classes. These classes rely on a concrete class udf_handler that does the actual work. The udf_handler::fix_field() is responsible to execute the udf_init() method. - udf_handler::val_() method executes the udf() that returns the xyz. For instance - udf_handler::val_str() executes the udf() that returns string. - udf_handler::fix_field() is deemed to be executed at preparation time while udf_handler::val_ is deemed to be executed at run time. - udf_handler::get_arguments() is deemed to be executed at the runtime right before the UDF() is called. LL01: Likely implementation: ---------------------------- 1. At preparation time, Item_udf_func::fix_fields() will calculate the character set and collation of all supplied argument strings. 2. Item_udf_func::fix_fields() further sets the resolved character string of the arguments and the return value as "binary". With this default setting, no conversion will be performed at runtime. 3. Item_udf_func::fix_fields() calls, udf_init() function which may set the "extended" arguments specifying the character set/collation of argument(s) or the return value. 4. When a UDF function is called, get_arguments() ensure that a provided string is converted to the character set expected by the UDF function. 5. After a UDF function has been called, the return value is expected to be in the character set that was decided earlier. According to expression it is used in, it may be further converted, but this is entirely up to the executor.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.