WL#8016: Parser for optimizer hints

Status: Complete

This work log will implement parser rules etc for new hint syntax as described in WL#3996.

Gleb has implemented 4 alternative prototypes for a to parse hints.

The basic approach for the four prototypes are as follows:

A set of parser rules that creates a list of hints which is forwarded to the hint processor. Each hint consists of a hint string (name) and a list of parameters. The parser does not need to be changed when new hints are added.
- Implementation details: the main parser grammar (sql_yacc.yy) has only one generic rule for all hints: it doesn't care about individual hint rule syntaxes, this is a business of underlying hint processor (C++) to recognize an validate them all.
- good: minimal lexical scanner (sql_lex.cc) changes: it have to recognize additional tokens from "/*+" and following "*/" strings and pass them to the main parser.
- good: minimal changes in the main grammar:
  - we add only one generic rule for all kinds of future hints at the very beginning and never change it again,
  - we insert references to that generic hint rule after each SELECT_SYM, INSERT, REPLACE, UPDATE_SYM and DELETE_SYM token entry in the grammar.
- bad: results of hint parsing are very generic: C++ code have to recognize and validate every hint from some random parse tree -- to much work, easy to bloat C++ code.
Parser rules for individual hints. Each hint type is handled by a separate contextualize function. New rules will have to be added for each hint.
- Implementation details: each individual hint extend the main grammar (sql_yacc.yy).
- good: minimal lexical scanner (sql_lex.cc) changes -- like in the 1st alternative.
- good: direct and easy way for add detailed rules for new optimizer hints into the current parser grammar (sql_yacc.yy).
- bad: detailed hint grammar bloats the main grammar (sql_yacc.yy).
A single parser rule which forwards the entire hint as a single string to the hint processor. No parser change is needed when hints are added, but hint processor will have to implement the lexer/parser for the hint string.
- good: minimal changes of the lexical scanner (sql_lex.cc) as before.
- good: minimal changes in the main grammar (sql_yacc.cc) -- recognize a special optional HINT_SYM token after every SELECT_SYM, INSERT, REPLACE, UPDATE_SYM and DELETE_SYM entry in the grammar. Also see the 4th alternative: we can do the same and live without that HINT_SYM token (i.e. no grammar changes at all).
- bad: this is a business of the optimizer to parse hint string byte by byte.
- bad or acceptable: we "parse" the hint string twice:
  - the 1st pass with the main lexical scanner to separate the string inside /*+ ... */ commentary,
  - the 2nd pass with some future hint string parser inside the optimizer.
The separate Bison grammar parses a hint commentary text with the help of the separate lexical scanner. The lexical scanner reuses (after some refactoring) our home-made hash function generator gen_lex_hash.cc to recognize hint names.
- good: no main grammar (sql_yacc.yy) changes at all: we pass hint parse tree from the lexical scanner to the main grammar as a token value of SELECT_SYM, INSERT, REPLACE, UPDATE_SYM and DELETE_SYM tokens.
- good: hint parser grammar is a separate Bison file (sql_hints.yy):
  - easy to extend,
  - detailed strict grammar for each of hints (like in the 2nd alternative),
  - no additional overhead to the main grammar parser.
- good: the current implementation doesn't parse hint commentary internal twice, the main lexical scanner (sql_lex.cc) just passes the control to the specialized hint lexical scanner (sql_lex_hints.{h,cc}).
- bad: the current implementation requires some overhead to look-up for the "/*!" but if and only if that string follows:
  - SELECT, INSERT, REPLACE, UPDATE or DELETE keyword at the beginning of the query or
  - SELECT keyword at the beginning of the query block (i.e. the SELECT_SYM token after the "(" symbol or after the UNION_SYM token).

Decision

The 4th way seems to be the best one:

It offers a way for a separate development of a hint grammar in the simple declarative way (separate Bison grammar for hints).
Additional parsing overhead looks low or negligible.
New separate lexical scanner requires a way to recognize new separate set of keywords (hint names). The current lexical scanner uses the generated hash function (by the gen_lex_hash.cc tool) that is hardcoded for use with main sql_lex.cc and sql_yacc.yy scanner and grammar. The 4th alternative gives a chance to refactor old and unstructured gen_lex_hash.cc etc infrastructure to produce a few separate hash functions for separate scanners/parsers.

Where it recognizes hint commentaries
Changes in the code
How to add a new hint
Data flow
- Old parser
- New parser

Where it recognizes hint commentaries

At the beginning of DML statements:
- SELECT /*+ ... */
- INSERT /*+ ... */
- REPLACE /*+ ... */
- UPDATE /*+ ... */
- DELETE /*+ ... */
At the beginning of query blocks:
- (SELECT /*+ ... */ ... )
- (SELECT ... ) UNION (SELECT /*+ ... */ ... )
- (SELECT /*+ ... */ ... ) UNION (SELECT /*+ ... */ ... )
- UPDATE ... WHERE x IN (SELECT /*+ ... */ ...)
- INSERT ... SELECT /*+ ... */
Same as #1 a #2 for explainable statements:
- EXPLAIN SELECT /*+ ... */ etc. (see #1)
- EXPLAIN UPDATE ... WHERE x IN (SELECT /*+ ... */ ...) etc. (see #2)

Changes in the code

New files

sql_hints.yy: new Bison parser for optimizer hints.
parse_tree_hints.{h,cc}: new parse tree node classes for hint parse tree.
sql_lex_hints.{h,cc}: new lexical scanner for optimizer hints.
sql_chars.h: old main lexical scanner "states" (character classes actually) and new character classification for hit lexical scanner.
sql_lex_hash.{h,cc}: common static code of the refactored lex scanner hash function.

New parse tree node classes

The hint parser uses descendants of the same Parse_tree_node class for its parse tree nodes like the main parser does. To apply the semantics of the parsed hint commentary to current LEX, SELECT_LEX etc objects, please use overloaded virtual Parse_tree_node::contextualize() function.

The PT_hint_list class object represents the whole hint commentary parse tree (PT_hint_list is ready to use). The PT_hint_list::hints contains a list of individual hints of a particular hint commentary string. Each hint parse tree node is a child of the common PT_hint class.

The current patch completely implements the MAX_EXECUTION_TIME hint support (please see the PT_hint_max_execution_time for the reference). Other hints from the base WL#3996 have dummy implementations (we need to overload each contextualize() function to complete them).

Hints and their parse tree node classes:

MAX_EXECUTION_TIME	PT_hint_max_execution_time
DEBUG_HINT1 (temporary)	PT_hint_debug1
DEBUG_HINT2 (temporary)	PT_hint_debug2

Hint parse tree usage in the main parser

SELECT statement and query blocks

We pass the resulting hint parse tree as a first or second parameter to parse tree nodes of the main parser to combine hint trees and the main tree into a single big parse tree. Then we call contextualize() on the tree root to apply hint semantics to the LEX/SELECT_LEX object.

select_init rule

generates a top-level SELECT or the Nth SELECT in the UNION.

Parse tree node classes:

PT_select_init2
PT_select_paren

query_specification rule

generates SELECT in the UNION (query specification)

Parse tree node class:

PT_select_paren_derived

select_paren_derived rule

generates SELECT in the UNION (query specification)

Parse tree node class:

PT_query_specification_select

table_factor rule

actually generates derived table SELECT

Parse tree node class:

PT_table_factor_select_sym

INSERT statement

See toplevel "insert_stmt" rule.

Parse tree node class: PT_insert.

REPLACE statement

See toplevel "replace_stmt" rule.

Parse tree node class: PT_insert (we share the same class with INSERT).

UPDATE statement

See toplevel "update_stmt" rule.

Parse tree node class: PT_update.

DELETE statement

See toplevel "delete_stmt" rule.

Parse tree node class: PT_delete.

New warning messages

ER_WARN_OPTIMIZER_HINT_SYNTAX_ERROR
"Optimizer hint syntax error"
ER_WARN_BAD_MAX_EXECUTION_TIME
"Unsupported MAX_EXECUTION_TIME"
ER_WARN_UNSUPPORTED_MAX_EXECUTION_TIME
"MAX_EXECUTION_TIME hint is supported by SELECT statements only"

Note: in the case of a *syntax* error (see ER_WARN_OPTIMIZER_HINT_SYNTAX_ERROR) the parser outputs a warning and skips the rest of hints in the hint commentary. For other hint errors (like ER_WARN_BAD_MAX_EXECUTION_TIME) we just skip the malformed hint and process the rest.

Refactoring (partial) of the main lexical scanner

For some historical reasons the old parser uses a hand-written lexical scanner. List of that reasons (probably incomplete):

"Lexer hacks": in some cases we change lexical scanner's state from the parser (this is a subject for the further refactoring).
Character encodings that not easy to parse with a generated lexical scanner (especially 1-2-4 byte long characters of GB18030 that we support in the lexer and 2 byte long UCS-2 characters that we still not support in the lexical scanner).

That lexical scanner uses the my_lex_states enum to classify characters of supported character sets. For some other historical reasons that enum is declared in the public m_ctype.h file. The idea to place that lexer's internal data there seems to be not very good, especially if we create new similar lexers with their own character classifications. So, we move my_lex_states enum declaration to the new sql_chars.h file.

The current CHARSET_INFO structure refers to a 256 byte long array with a table of "states" for the main lexical scanner (my_lex_states enum values).

It seems like that the reasons to save lexer's internal data in global CHARSET_INFO structures are simplicity and performance. Usually we access CHARSET_INFO from the parser/lexer via the search by the charset name (that is expensive) or by the CHARSET_INFO pointer (but not by the charset number in the charset array or so on). If we save that "state" array pointer outside the CHARSET_INFO structure, then we need an additional search by a charset name to access it.

Thus it was decided to stay with a pointer to lexer's internal data in the CHARSET_INFO structure, but the initialization of that data has been moved from the mysys library to the lex_init() function. Also, the CHARSET_INFO::states field has been extended to save additional lexer's information (hint lexer's character classifications). Note: Binary structure of the CHARSET_INFO structure itself is unchanged.

Refactoring of the hash function generator

The existing lexical scanner uses a generated hash function to recognize keywords and internal function names. For historical reasons we use a hand-written hash function generator: gen_lex_hash.cc. That generator is tied to the main lexical scanner and parser code, i.e. it can't generate more than 2 hash functions by design (one for reserved words and one for internal function names + reserved words).

Since we introduce a new parser with a separate set of reserved words, we have to recognize these reserved words in some way. The current hash function generator produces fast hash functions that seem to outperform some industry-standard generators, so it would be nice to reuse it for new parsers. Thus the WL introduces a way to refactor and extend gen_lex_hash.cc functionality for that.

First, we move static parts of generated code from gen_lex_hash.cc string constants and from sql_lex.cc to new sql_lex_hash.{h,cc} files.
Then, we merge "symbol" arrays in the lex.h file into one for the simplicity/performance: in any hash function we need just a plain index in the "symbols" array to locate the SYMBOL object. Actually, we can have a growing number of SYMBOL arrays there, but please take into account that our sets of SYMBOL objects intersect, so it's logical to save them all into single array and mark different sets with the SYMBOL::group bitmap.
Finally, we refactor and extend the rest of gen_lex_hash.cc to be more structured, readable and flexible to produce any number of hash functions on demand.

How to add a new hint

Note: please see existent implementation of MAX_EXECUTION_TIME, DEBUG_HINT1 and DEBUG_HINT2 hints for reference.

For example, we have to create a new hint:

 BKA([@query-block-name] table, [table ...])

where

 table ::= table-name[@query-block-name]

sql_hints.yy

Define new token for hint keyword

To define the new BKA keyword, please declare the BKA_HINT token:

  /* Hint keyword tokens */
  
  %token MAX_EXECUTION_TIME_HINT
  ...
+ %token BKA_HINT

Create grammar rule for new hint

Important note: please use the YYABORT macro for fatal errors (OOM) only!

The YYABORT macro in the hint parser aborts the whole SQL query!

Add a rule for the new hint:

  bka:
            BKA_HINT '(' opt_qb_name hint_param_table_list ')'
            {
              $$= NEW_PTN PT_hint_bka($3, $4);
              if ($$ == NULL)
                YYABORT; // OOM
            }
          ;

There we reserve the PT_hint_bka name for the parse tree node that we'll declare later.

Note: opt_qb_name and hint_param_table_list rules exists already for query-table-name and table, [table ...] list, respectively.

Also, there is hint_param_index_list and opt_hint_param_index_list for mandatory and optional index lists, respectively.

To create a numeric parameter to the hint, please use the HINT_ARG_NUMBER token: it returns LEX_CSTRING with a source text of a positive decimal integer. If the result value in the input query is not suitable for your hint (out of range error) please don't abort the parser with YYABORT: please output a warning and return NULL from the hint rule instead. For example, see the max_execution_time_hint rule:

 max_execution_time_hint:
         MAX_EXECUTION_TIME_HINT '(' HINT_ARG_NUMBER ')'
         {
           int error;
           char *end= const_cast<char *>($3.str + $3.length);
           longlong n= my_strtoll10($3.str, &end, &error);
           if (error != 0 || end != $3.str + $3.length || n > UINT_MAX32)
           {
      // NOTE: don't call YYABORT there!
      // Please output a warning and set $$ to NULL instead:
             scanner->syntax_warning(ER(ER_WARN_BAD_MAX_EXECUTION_TIME));
             $$= NULL;
           }
           else
           {
             $$= NEW_PTN PT_hint_max_execution_time(n);
             if ($$ == NULL)
               YYABORT; // This is fatal error (OOM), please call YYABORT
	         // to shutdown the whole thread
           }
         }
      ;

Declare type for new hint

All hint rules have the same hint "type":

   %type <hint>
     max_execution_time_hint
     ...
+    bka

Add new grammar rule to RHS of hint rule

All hints rules are enumerated as RHS (right hand sides of the) "hint" grammar rule:

   hint:
             max_execution_time_hint
           ...
+          | bka
           ;

lex.h

Add "BKA" string with the BKA_HINT key to hint lexical scanner's hash function:

   static const SYMBOL symbols[] = {
   ...
     /*
       Insert new optimizer hint keywords after than commentary:
     */
     { SYM_H("MAX_EXECUTION_TIME",             MAX_EXECUTION_TIME_HINT)},
     ...
+    { SYM_H("BKA",                            BKA_HINT)},

parse_tree_hints.{h,cc}

Declare a class for the parse tree node:

   class PT_hint_bka : public PT_hint
   {
     typedef PT_hint super; // private alias, not visible for successor classes
     typedef Mem_root_array<Hint_param_table, true> Table_list;

     /*
       Token values and parse tree nodes from a RHS of the bka grammar rule:
     */
     const LEX_CSTRING opt_qb_name;
     Table_list *table_list;

   public:
     PT_hint_bka(const LEX_CSTRING &opt_qb_name_arg,
                 Table_list *table_list_arg)
     : opt_qb_name(opt_qb_name_arg),
       table_list(table_list_arg)
     {
       // Parse-time constructor: don't access LEX, SELECT_LEX etc here!
     }

     virtual bool contextualize(Parse_context *pc)
     {
       if (super::contextualize(pc))
         return true; // failure

       // TBD: access/modify SELECT_LEX there via pc->select()

       return false; // success
     }
   };

Predefined tokens

The hint lexical scanner provides a few predefined tokens to build hint grammar rules.

Hint argument token names starts with the HINT_ARG_ prefix.

HINT_ARG_NUMBER

 HINT_ARG_NUMBER ::= [0-9]+

This token returns positive integers in a textual form (LEX_CSTRING). We have to check ranges and convert token's value at the parser's side. See the max_execution_time_hint grammar rule for references.

HINT_ARG_IDENT

This token's scanner is designed to be compatible with main parser's rules for identifier scanning.

Identifier formats:

Simple identifier:

starts with "_", "$" or any alphabetical character,

continues with 0 or more "_" or "$" signs, alphabetical characters or decimal digits.
Identifier with a numeric prefix:

starts with a decimal digit,

continues with 1 or more "_" or "$" signs, alphabetical characters or decimal digits.
Back-quoted identifier:

starts with the back-quote (`) separator sign,

continues with 1 or more valid characters excluding "\0",

duplicated separators (``) convert into a single ` symbol (not a separator),

ends with the back-quote (`) separator sign.
Double-quoted identifier:

starts with the double-quote (") separator sign,

continues with 1 or more valid characters excluding "\0",

duplicated separators ("") convert into a single " symbol (not a separator),

ends with the double-quote (") separator sign.

Note 1: The "alphabetic" character set includes:

national alphabetic characters of the current character set,
valid multibyte characters (if any) of the current character set.

Note 2: The "valid" character set includes:

for single-byte character sets: any character excluding "\0";
for multi-byte character sets: any valid single- or multi-byte character excluding "\0".

Note 3: Double-quote sign is an identifier separator if and only if the ANSI_QUOTES flag is on (see the SQL_MODE system variable).

Note 4: The lexical scanner converts and decodes identifier names itself:

at the parser side identifiers always have a system charset,
all double-quote and back-quote separators are stripped down and
all inner duplicate separator signs are decoded into single symbols.

Note 5: From the SQL parser's "point of view" our hint expressions are regular commentaries. Thus, the "*/" sequence must close the hint expression list, even the "*/" sequence occurs inside a quoted identifier!

So, the

 SELECT /* HINT(`tricky hint name*/`) */ 1

query will fail with a syntax error at "`) */ 1".

OTOH the

 SELECT /* HINT(`tricky hint name*/`) */`.* FROM t1 `) */`

is a valid query that returns columns of the t1 (the `) */` sequence is a tricky alias of t1). In that case the hint parser will warn about invalid HINT and return the HINT_ERROR token to the hint parser.

HINT_ARG_QB_NAME

This token is a HINT_ARG_IDENT with the "@" prefix sign (whitespace characters between the "@" prefix and the identifier suffix aren't allowed).

The lexical scanner recognizes the HINT_ARG_QB_NAME token if and only if:

The "@" sign occurs right after the previous HINT_ARG_IDENT token (whitespace

separators between the previous identifier and the "@" sign are not allowed):

ident@query_block_name

`tricky ident`@query_block_name,

ident@`select#1`,

`tricky ident`@`select#1`,

123tricky$ident@012tricky_query_block_name etc.
The "@" sign occurs after the "(" signs (whitespace separators are ok):

/*+ HINT(@query_block_name ... ) */,

/*+ HINT( @query_block_name ... ) */

/*+ HINT( @"select#1" ... ) */ -- if ANSI_QUOTES flag is set,

/*+ HINT( @```tricky name``` ... ) */ etc.

The HINT_ARG_QB_NAME token's scan process depends on the context:

HINT_CLOSE and HINT_ERROR

These tokens are for use in the root start rule only.

Please don't reference to them in hint grammar rules.

The HINT_CLOSE token represents the "*/" sequence in the parser grammar: it's needed since we don't scan the hint "commentary" string twice (the main lexer passes its control to the hint scanner/parser on the "/*+" sequence, and the hint parser/lexer returns the control back on the "*/" sequence or on the end of query).

The HINT_ERROR token represents scanner errors:

OOM,
invalid characters,
unexpected end of query string,
empty quoted identifier ("" if ANSI_QUOTES is on or ``).

How to add a new semantic value type

Since we have to save the compatibility with the old Bison 2.1 release, it is problematic to have a separate semantic value type (%union) per grammar: we share the same union YYSTYPE (see sql_lex.h) for both main sql_yacc.yy and new sql_hints.yy grammars instead.

The YYSTYPE union starts with a hint-specific section:

union YYSTYPE {
  /*
    Hint parser section (sql_hints.yy)
  */
  opt_hints_enum hint_type;
  LEX_CSTRING hint_string;
  class PT_hint *hint;
  class PT_hint_list *hint_list;
  Hint_param_index_list *hint_param_index_list;
  Hint_param_table hint_param_table;
  Hint_param_table_list *hint_param_table_list;

  /*
    Main parser section (sql_yacc.yy)
  */
...

Please add new "types" there (before the main parser section). Please take into account that the type name must be unique for any parser, main or hint-related.

For example, if you are going to add to grammar some token or rule of type Sample_type with semantic value type name sample_type, please do:

sql_lex.h: add a new field to the "Hint parser section" of the YYSTYPE union:

union YYSTYPE {
  /*
    Hint parser section (sql_hints.yy)
  */
  opt_hints_enum hint_type;
  LEX_CSTRING hint_string;
  class PT_hint *hint;
  class PT_hint_list *hint_list;
  Hint_param_index_list *hint_param_index_list;
  Hint_param_table hint_param_table;
  Hint_param_table_list *hint_param_table_list;
+ Sample_type sample_type;

  /*
    Main parser section (sql_yacc.yy)
  */
...

sql_hints.yy: add new terminals or non-terminals with the sample_type type:

+ %type <sample_type> SAMPLE_TOKEN sample_rule

Data flow

Old parser

In the original server, we activate the parser by the MYSQLparse() function call. The MYSQLparse() function pulls lexical data from the lexical scanner token by token with the help of the MYSQLlex() function.

New parser

In the new server we activate the main parser by the MYSQLparse() function as before. The main parser pulls tokens from the main lexical scanner by the repetitive MYSQLlex() call as well.

Additionally, the main lexical scanner looks up for the "/*+" string now and may pass the control to the secondary parser (HINT_PARSER_parse() function) in some conditions. With the help of the secondary lexical scanner (HINT_PARSER_lex() function) it pulls a hint string token by token and returns the control to main lexical scanner and main parser at the end of the hint commentary string.

When the secondary (hint) parser returns its control to the main lexical scanner, the lexical scanner associates the resulting hint parse tree with the value of SELECT_SYM, INSERT, REPLACE, UPDATE_SYM or DELETE_SYM and return it that way to the main parser.

The main parser gets optional hint parse tree pointer from SELECT_SYM etc. token values or silently ignores that value (if the hint commentary occurs at the wrong place).

WL#8016: Parser for optimizer hints

Functional requirements:

Non-Functional requirements:

Decision

Contents

Where it recognizes hint commentaries

Changes in the code

New files

New parse tree node classes

Hint parse tree usage in the main parser

SELECT statement and query blocks

select_init rule

query_specification rule

select_paren_derived rule

table_factor rule

INSERT statement

REPLACE statement

UPDATE statement

DELETE statement

New warning messages

Refactoring (partial) of the main lexical scanner

Refactoring of the hash function generator

How to add a new hint

sql_hints.yy

Define new token for hint keyword

Create grammar rule for new hint

Declare type for new hint

Add new grammar rule to RHS of hint rule

lex.h

parse_tree_hints.{h,cc}

Predefined tokens

HINT_ARG_NUMBER

HINT_ARG_IDENT

HINT_ARG_QB_NAME

HINT_CLOSE and HINT_ERROR

How to add a new semantic value type

Data flow

Old parser

New parser