WL#12370: Extend the UDF API to handle properly character sets of the arguments and the return value

Affects: Server-8.0   —   Status: Complete

Currently the UDF API doesn't really handle character sets properly.

There are following two problems :

1. The String UDF expected to return a "char *" encoded into the character
   set of the return argument (as declared in CREATE FUNCTION ... SONAME).
   But the UDF doesn't really know what character set of the return value is.

2. The string arguments are expected to receive a "char *" encoded in character
   set perceived by the server but the UDF doesn't really know what is
   character set of arguments. So currently there's no guarantee that anything
   but US ASCII can be handled reliably and predictably as a string argument.

Aim of this worklog is to address above two problems with the UDF as following:

1. UDF should be able to specify the character set of the returning "char *".
   UDF() should be able convert the return value in the specified character
   set. UDF() may use string component services for converting the return value
   into the charset specified.

2. UDF should be able to know the character set name of each argument as
   determined by the server.
Functional Requirements :
-------------------------
Legend :
  (a) User : UDF author who wrote the UDF(s).

FR01 : User must be able to read following extension attributes of UDF
       arguments as well as of return value.
       (a) charset :- Character set name
       (b) collation :- Collation name

FR02 : User must be able to set the following extension attributes of UDF
       arguments as well as of return value.
       (a) charset :- Character set name
       (b) collation :- Collation name

       FR02.1 : User must set extension attributes which are supported by the
                server.
                For instance:
                (1) charset name : utf8mb4, cp1250 etc.
                (2) collation name : utf8mb4_general_co, cp1250_czech_cs etc.

        Note : Execute SHOW CHARACTER SET; statement to check charset and
               collation names supported by the server.

FR03 : At the UDF preparation time(i.e. when udf's init() method is  called),
       if the user specifies wrong character set name then UDF initialization
       must fail with appropriate error.

FR04 : At the UDF run time(i.e. When actual UDF executes), if the character set
       of either of argument is found to be different then specified by the
       user at preparation time then UDF framework must convert the argument
       value in the character set specified by the user at preparation time.

FR05 : User must perform the character set conversion of the return value on
       his own. User may user string component services to perform the
       conversion.

Non-Functional Requirements :
-----------------------------
NFR-01 : It must be possible to scale the extension argument in the UDF_INIT
         and UDG_ARGS structure to add more capabilities in future.

NFR-02 : Changes proposed in this worklog must not break existing UDFs.
Legend :
 - user - UDF author, who writes the UDF(s).
 - udf_init() - The init method for the udf defined by the user.
 - udf()  - Actual udf function defined by the user.
 - udf_deinit() - The deinit method defined by the user.

HS01: Possible solutions
-------------------------

There could be the following possible approaches:

1. Ensure that an explicit conversion is made at call time from/to the declared
   character sets for each argument and the return value in the udf_init().
   Let the udf() deal in the arguments in specified character sets.

   Pros: Easy for the UDF authors. They will just need to specify character
         sets of argument and return value.

   Cons: A performance penalty of Strings conversion to/from specified
         character sets.

2. Pass down the character set name of each argument to the UDF framework
   and let it decide to convert the arguments value from to what character set.

   Pros: Makes UDFs similar to built-in functions performance wise as it avoids
         the extra transformations and allows UDFs to deal with string the way
         the rest of the server does.

   Cons: Increased complexity for UDF authors. They either need to convert
         themselves or use the likes of the string component service to deal
         with strings in varying encodings.
┌─────────────────────────────────────────────────────────────────────────────┐
│ This worklog aims at providing behavior similar to approach#1 using best of │
│ both approaches. It will provide the API(s) to the user to specify or read  │
│ the char_set_names of arguments and return value.                           │
│ Server will provide the converted arguments to the udf(). Udf author may    │
│ convert the return value in the specified character set in the udf().       │
└─────────────────────────────────────────────────────────────────────────────┘

HS02 : If we extend the UDF APIs to pass down the information then we could
       break ABI compatibility. Luckily WL#2872 added an extension argument in
       the UDF_ARGS and UDF_INIT. This argument is currently not used so we can
       use it to pass the extra info in a reliable way without breaking the ABI
       compatibility. We shall add interface functions for handling it. UDF
       author will be able to set or retrieve the value to & from these
       arguments using those methods.

HS03: API specifications :
--------------------------

HS03.1 A new server component service "mysql_udf_metadata" will be added.  This
       service will provide the methods related to UDFs metadata.
       
HS03.2 Above component service will provide following udf metadata methods 
       in order to get or set extension argument of UDF arguments or return
       value.
       (1) argument_get : To retrieve the extension attributes of UDF
                          arguments.
       (2) argument_set : To set the extension attributes of UDF arguments.
       (3) result_get : To retrieve the extension attributes of the return 
                        value.
       (4) result_set : To set the extension attributes of UDF arguments.

HS03.3 API that retrieves the extension attribute of a UDF argument.

  argument_get(UDF_ARGS * udf_args /* Handle of UDF_ARGS structure*/,
               const char *extension_type /* Type of extension attribute to
                                             get */,
               unsigned int index /* Index of UDF argument of which extension
                                     attribute to be fetched */,
               void **out_value   /* Retrieved value */);

  One could retrieve the charset of first UDF argument as following.

  void *out_value = nullptr;
  const unsigned index = 0;
  my_service service("mysql_udf_metadata",
                                              mysql_plugin_registry_acquire());
  service->argument_get(udf_args, "charset", index, &out_value);
  const char *charset_name = static_cast(out_value);

HD03.4 API that sets the extension attribute of a UDF argument


  argument_set(UDF_ARGS * udf_args /* Handle of UDF_ARGS structure */,
                       const char *extension_type /* Type of extension
                                                     attribute to set */,
                       unsigned int index  /* Index of UDF argument of which
                                              extension attribute to be set*/,
                       void *in_value      /* Value to be set */);

    One could set the charset of first UDF argument as following.

    const char* name = "utf8mb4";
    char *value = const_cast(name);
    my_service service("mysql_udf_metadata",
                                              mysql_plugin_registry_acquire());
    service->argument_set(udf_args, "charset", 0, static_cast(value));

HS03.5 API that retrieves the extension attribute of a UDF return value.

    result_get(UDF_INIT * udf_init  /* Handle of UDF_INIT
                                                        structure */,
                                const char *extension_type /* Type of extension
                                                              attribute to be
                                                              retrieved */,
                                void **out_value   /* Retrieved value */);
                     
    One could retrieve the charset of UDF return value as following.                 
    
    void *out_value = nullptr;
    my_service service("mysql_udf_metadata",
                                              mysql_plugin_registry_acquire());
    service->result_get(udf_init, "charset", 0, &out_value);
    const char *charset_name = static_cast(out_value);

HS03.6 API that sets the extension attribute of a UDF argument

    result_set(UDF_INIT * udf_init /* Handle of UDF_INIT
                                                       structure*/,
                                const char *extension_type /* Type of extension
                                                              attribute to be
                                                              retrieved */,
                                void *in_value   /* Value to be set */);
                     
    One could retrieve the charset of UDF return value as following.                 
    
    const char* name = "utf8mb4";
    char *value = const_cast(name);
    my_service service("mysql_udf_metadata",
                                              mysql_plugin_registry_acquire());
                                              
    service->result_set(udf_init, "charset", 0, static_cast(value));
                                       

HS03.7 The argument name 'extension_type' in the APIs above is chosen to be of
       char* instead of an enum because of compatibility reasons. It will be
       easier to deprecate and remove an argument without breaking existing
       UDFs. Above APIs will do case insensitive comparison for argument name
       'extension_type'.
       
HS03.8 The extension pointer in the UDF_ARGS structure will point to the
       following structure. We could add members to the structure in case we
       decide to add more extension arguments to the UDF argument. This
       structure is opaque to the UDF users, it is accessible through the APIs
       listed above.
       
      struct Udf_args_extension {
        Udf_args_extension() : charset_info() {}
        const CHARSET_INFO **charset_info;
      };
       
HS03.9 The extension pointer in the UDF_INIT structure will point to the
       following structure. We could add members to the structure in case we
       decide to add more extension arguments to the return value. This
       structure is opaque to the UDF users, it is accessible through the APIs
       listed above.
       
      struct Udf_return_value_extension {
        Udf_return_value_extension(const CHARSET_INFO *charset_info = nullptr)
        : charset_info(charset_info) {}
          const CHARSET_INFO *charset_info;
      };


HS04: Usage specifications
--------------------------

- The UDF arguments are  evaluated right before udf_init() method is called.
  That means user could read the extension arguments of any of UDF arguments
  in the udf_init() method by  calling the my_get_extension_arg() API as
  explained in the previous section.
  
- If user wishes to change the default values of extension argument of any of
  the UDF argument then user can do that by calling my_set_extension_arg() API
  as explained in the previous section.
    
- If character set or collation type of extension argument of any of the UDF
  argument(s) is set other then default value then server will provide
  converted method arguments to the udf() at the time of execution. The 
  argument(s) will be converted into the character set as it was set by the
  user during udf_init().

- User could retrieve the extension arguments of the return value by calling 
  my_get_extension_result_arg() API as explained in the previous section. 

- User could set the extension  argument of return value by calling 
  my_set_extension_result_arg() API.  
  
- If user converts the return value from default charset/collation to any other
  charset/collation then user must set the changed name in the extension
  argument.
 
Background :
------------
  Following details are for information purpose only.

  - The UDF execution is handled either by Item_udf_func or its derived
    classes. These classes rely on a concrete class udf_handler that does the
    actual work. The udf_handler::fix_field() is responsible to execute the
    udf_init() method.

  - udf_handler::val_() method executes the udf() that returns the xyz.
    For instance - udf_handler::val_str() executes the udf() that returns
    string.

  - udf_handler::fix_field() is deemed to be executed at preparation time while
    udf_handler::val_ is deemed to be executed at run time.

  - udf_handler::get_arguments()  is deemed to be executed at the runtime
    right before the UDF() is called.


LL01: Likely implementation:
----------------------------

1. At preparation time, Item_udf_func::fix_fields() will calculate the
   character set and collation of all supplied argument strings.

2. Item_udf_func::fix_fields() further sets the resolved character string of
   the arguments and the return value as "binary". With this default setting,
   no conversion will be performed at runtime.

3. Item_udf_func::fix_fields() calls, udf_init() function which may set the
   "extended" arguments specifying the character set/collation of argument(s)
   or the return value.

4. When a UDF function is called, get_arguments() ensure that a provided string
   is converted to the character set expected by the UDF function.

5. After a UDF function has been called, the return value is expected to be in
   the character set that was decided earlier. According to expression it is
   used in, it may be further converted, but this is entirely up to the
   executor.