WL#12668: Pre-parse plugin for DDL statement clause filtering

Affects: Server-8.0   —   Status: Complete

In various situations when restoring output from a dump file, it is relevant to
filter out certain clauses from DDL statements. This is useful, e.g. when
restoring a dump file into an environment where encryption will not be applied.
In such a case, we would like to filter out the encryption clauses from the
CREATE TABLE statements.

The purpose of this WL is to design and implement a plugin which does simple
string replacement of DDL statements from dump files based on regular expression
matching. This will be an audit type plugin catching and handling pre-parse events.

It should be noted from the beginning that since the plugin is based on regular
expression matching, it is clear that there may be situations where the plugins
decides to replace occurrences that should not be replaced, e.g. if the search
pattern is matched within a comment.
Functional requirements
-----------------------
Note: The regular expressions below are defined using the ECMA script syntax as 
used by the C++ regex package 
(http://www.cplusplus.com/reference/regex/ECMAScript/).

FR0. DDL statements are expected to be generated by SHOW CREATE TABLE. This
means that complicating issues such as leading whitespace, arbitrary comments,
etc. can be ignored.

FR1. Only statements where the first character in the query string is 'c' or 'C'
shall be considered for rewrite.

FR2. Statements shall be considered for rewrite if and only if they match the 
following regular expression case insensitively:

  "^CREATE\s+TABLE"

This also covers FR1 of course, but for performance reasons, the check in FR1 is
done before FR2 to weed out statements that do not need rewriting as early as
possible.

FR3. Statements considered for rewrite shall have their DIRECTORY clauses
removed and replaced by a single space (to maintain proper separation of clauses), 
if matched by the following regular expression case insensitively: 

  "\s*,?\s*(DATA|INDEX)\s+DIRECTORY\s*?=?\s*?'[^']+?'\s*,?\s*"

FR4. Statements considered for rewrite shall have their ENCRYPTION clauses
removed and replaced by a single space (to maintain proper separation of clauses), 
if matched by the following regular expression case insensitively: 

  "\s*,?\s*ENCRYPTION\s*?=?\s*?'[NY]?'\s*,?\s*"

FR5. Multi line statements shall be handled correctly, i.e., clauses shall be
removed as expected regardless of which line they are located in.

FR6. Per partition directory clauses shall be handled correctly, i.e., the
DATA|INDEX DIRECTORY clause shall be removed both at the table level and for
individual partitions.


Non-functional requirements
---------------------------
NFR1. Query execution time shall not be significantly affected (typically <10%
increase in average execution time for load as specified in the "Non functional
test" in the QA notes below).

NFR2. The implementation of the plugin shall be unit-testable.
Overall plugin design
---------------------
The plugin is defined by an audit plugin descriptor and a generic plugin descriptor:

/* Audit plugin descriptor. */
static struct st_mysql_audit ddl_rewriter_descriptor = {
    MYSQL_AUDIT_INTERFACE_VERSION, /* interface version */
    NULL,                          /* release_thd()     */
    rewrite_ddl,                   /* event_notify()    */
    {
        0,
        0,
        (unsigned long)MYSQL_AUDIT_PARSE_ALL,
    } /* class mask        */
};

/* Plugin descriptor */
mysql_declare_plugin(audit_log){
    MYSQL_AUDIT_PLUGIN,           /* plugin type             */
    &ddl_rewriter_descriptor,     /* type specific descriptor*/
    "ddl_rewriter",               /* plugin name             */
    "Oracle",                     /* author                  */
    "Rewrite of DDL statements.", /* description             */
    PLUGIN_LICENSE_GPL,           /* license                 */
    plugin_init,                  /* plugin initializer      */
    NULL,                         /* plugin check uninstall  */
    NULL,                         /* plugin deinitializer    */
    0x0001,                       /* version                 */
    NULL,                         /* status variables        */
    NULL,                         /* system variables        */
    NULL,                         /* reserved                */
    0                             /* flags                   */
} mysql_declare_plugin_end;

The plugin initializer will register a performance schema key to track memory
usage, as shown below.


Performance schema instrumentation to track memory allocation
-------------------------------------------------------------
static PSI_memory_key key_memory_ddl_rewriter;

static PSI_memory_info all_rewrite_memory[] = {
    {&key_memory_ddl_rewriter, "ddl_rewriter", 0, 0, PSI_DOCUMENT_ME}};

static int plugin_init(MYSQL_PLUGIN) {
  const char *category = "rewriter";
  int count = static_cast(array_elements(all_rewrite_memory));
  mysql_memory_register(category, all_rewrite_memory, count);
  return 0; /* success */
}


Handling rewrite events
-----------------------
When notified about a new event, the plugin will first check whether this is an
event of the appropriate type with the expected query string, and make an early
exit if we don't need further analysis of the event, along the lines of the
following pseudo code (note that since this is just pseudo code, e.g. variable
types are not necessarily consistent and correct):

  if (event_class != MYSQL_AUDIT_PARSE_CLASS                         ||
      event_parse->event_subclass != MYSQL_AUDIT_PARSE_PREPARSE      ||
      event_parse->query.length == 0                                 ||
      (event_parse->query[0] != 'C' && event_parse->query[0] != 'c') ||
      !regex_search(event_parse->query, "^CREATE\s+TABLE", icase))
    return 0;

We now know this is pre-parsing of a CREATE TABLE statement. So we can go ahead
with replacements of DATA|INDEX DIRECTORY = '...' and ENCRYPTION = '...', but note:

  - Optional preceding and trailing commas.
  - Optional assignment operator.
  - Path enclosed in single quotes.
  - Use of ungreedy patterns to get correct grouping.

The regex definition and replacement is done according to the following pseudo code:

  std::regex directory_option(
      "\s*,?\s*(DATA|INDEX)\s+DIRECTORY\s*?=?\s*?[\"'][^\"']+?[\"']\s*,?\s*"
      std::regex::icase);
  rewritten_query = std::regex_replace(event_parse->query,
                                       directory_option, " ");

  std::regex encryption_option(
      "\s*,?\s*ENCRYPTION\s*?=?\s*?[\"'][NY]?[\"']\s*,?\s*",
      std::regex::icase);
  rewritten_query = std::regex_replace(rewritten_query, 
                                       encryption_option, " ");

Finally, if there was indeed a rewrite, we record the rewritten query in the
event that was broadcasted (note that memory allocation with performance schema
instrumentation, and string handling, are not included in the pseudo code below):

  event->rewritten_query->str = rewritten_query.c_str();
  event->rewritten_query->length = rewritten_query.length();
  *(reinterpret_cast(event_parse->flags)) |=
      MYSQL_AUDIT_PARSE_REWRITE_PLUGIN_QUERY_REWRITTEN;