WL#8989: Create an internal API and I_S interface for the MySQL parser to enumerate reserved and non-reserved words

Affects: Server-8.0   —   Status: Complete

It would be nice to introduce interfaces for enumerating reserved and non-reserved MySQL words:
  • in a form of generated .h file for the compile-time use in server sources and other C/C++ projects;

  • in a form of I_S view for the run-time use.

Note: see BUG#25366 for the original feature request.

  • F-1: I_S should contain a new view: INFORMATION_SCHEMA.keywords.

  • F-2: INFORMATION_SCHEMA.keywords should have two columns.

  • F-3: The 1st column "WORD" is of type VARCHAR(N).

  • F-4: The 2nd column "RESERVED" is of type INT.

  • F-5: SELECT WORD, RESERVED FROM INFORMATION_SCHEMA.keywords should return a complete list of reserved words and non-reserved words (keywords) from the MySQL SQL dialect grammar.

  • F-6: The 1st column should return all reserved and non-reserved words.

  • F-7: The 2nd column should return the value of 1, if the value in the 1st column is a reserved word, otherwise 0.

  • NF-8: At a build time, the compiler should generate a C/C++ include file: keyword_list.h.

  • NF-9: keyword_list.h should be a GPL-licensed file with correct copyright dates.

  • NF-10: keyword_list.h should be a self-contained C and C++ file.

  • NF-11: keyword_list.h should contain the "keyword_list" array of two columns with a similar structure and content like INFORMATION_SCHEMA.keywords has: the 1st column is of type "const char *", the 2nd is "int" etc. -- see F-3, F-4, F-6 and F-7.

Currently we have two complementary sources of the information we need:
  • The symbols array in the lex.h file. This array is insufficient for the current task, since both reserved and non-reserved words are intermixed there, and there is no direct way to differentiate them. Manual adding of "reserved"/"non-reserved" flags doesn't looks like a solution: this is very easy to do mistakes and be out of sync with the grammar.

  • The ident_keyword rule and its descendants in the grammar. Right hand sides (RHS) of those rules enumerate almost all possible names of non-reserved word. However, there is no information about textual token values. Also, we still have no reserved word info.

Implementation plan

Part I: compile-time generation

  1. Extract the list of non-reserved words from the grammar (ident_keyword etc.)

  2. To simplify the previous step, modify the grammar file:

    For some historical reason, almost all RHS of ident_keyword and friends suppress type checks with the with an empty semantic action:

    ...
      ACCOUNT_SYM           {}
    | ASCII_SYM             {}
    | ALWAYS_SYM            {}
    | BACKUP_SYM            {}
    | BEGIN_SYM             {}
    ...
    

    OTOH, by the fact, all of them have the <symbol> type, and at the outer level we access their semantic values by the <symbol> type tag.

    It would be nice to fix that type suppression, within the current WL or not. But in our case adding type tags to %token definitions of RHS terminals can simplify out task a lot:

    • this way we force developers to declare typed %tokens for every new non-reserved word: %token<symbol> SOME_NEW_TOKEN, otherwise Bison will warn on untyped new tokens;

    • instead of the recursive parsing of the ident_keyword rule tree, the task of a keyword extraction becomes a trivial regex scan job.

    Note: obviously, some developer can fool that regexp scan in some way, e.g. by adding a type tag with a separate %type declaration outside the %token line. That should be a rare case, so let add a check for that case later in a follow-up task.

  3. Now we have a set of token names for non-reserved keywords. Token names can coincide with textual word names, but sometimes they don't (because of the _SYM suffix etc.), so we need some way to get original word names:

    • symbols[] in lex.h is an associative array of textual names we need and token numbers.

    • Normally, if only we don't reset that order intentionally (and we don't), the 1st %token in the grammar file has the numeric value of 257. The rest of %tokens have values incremented by 1. Thus, we can align extracted token names from the grammar with token numbers in the symbols array (see lex.h). Then, this is trivial to get original names of non-reserved words.

    • The rest of names in the symbols array, excluding entries related to operators and optimizer hints, are reserved word names.

  4. Write a generated lex_keyword_list.h with a resulting array of two columns:

    • a word,

    • a binary "reserved"/"non-reserved" flag.

Part II: I_S interface

  1. It seems logical to use I_S for the new feature.

    Update DD to create such a view:

    CREATE VIEW INFORMATION_SCHEMA.keywords AS SELECT * FROM JSON_TABLE('[[ ["ACCESSIBLE", 1], ["ACCOUNT", 0], ["ACTION", 0,] ...]', '$[*]' COLUMNS( word VARCHAR(...) PATH '$[0]', reserved INT PATH '$[1]')) AS j

  2. The current I_S API requires a manual I_S version increment when I_S metadata migration or refresh is necessary. However, we can't expect such a manual update from SQL grammar developers, since the relation between grammar and I_S is not clear. Extra comments in the grammar won't help to not make mistakes too much though, so we need any suitable way to force developers for incrementing version numbers.

    The current WL proposes a test file trick to warn on incompatible changes in the keyword list: we can compare actual and saved INFORMATION_SCHEMA.keywords results, and fails if results differ. Then, if the I_S version of the saved result and the current I_S version are same, the test file notices developers to simply update saved results. Otherwise, if the I_S version has changed too, the test file will notify developers to update the I_S version in DD API source code first.

Note: no version tricks needed if we decide to create INFORMATION_SCHEMA.keywords on top of PFS instead of DD:

CREATE VIEW INFORMATION_SCHEMA.keywords AS SELECT * FROM performance_schema.keywords

... where the performance_schema.keywords table's handler maintains the actual keyword list, while the I_S metadata stays unchanged for a while.

However, PFS storage is an optional feature, so we can meet a PFS-less binary, where select from INFORMATION_SCHEMA.keywords will fail -- probably this is not an options.

New files:

  • sql/gen_keyword_list.cc: a build-time tool to transfor sql_yacc.yy into keyword_list.h.

  • sql/dd/impl/system_views/keywords.{h,cc}: a definition of dd::system_views::Keywords.

  • mysql-test/t/information_schema_keywords.{test,result}

New classes:

  • dd::system_views::Keywords: the INFORMATION_SCHEMA.keywords view creator class.

  • dd::system_views::System_view_query_expr_definition_impl: a helper class to create views on top of random length union chains (impressed by dd:system_views::System_view_union_definition_impl).