WL#7199: True bottom-up server parser: common framework for the refactoring
Affects: Prototype Only
—
Status: Complete
This is a design WL for the common bottom-up parser refactoring framework. This worklog implements one step in the overall task to "Refactor MySQL server parser to build the AST in a natural "bottom-up" way". One of design goals is to have larger syntax rules in the future, leading to fewer (but larger) intermediate nodes in the parse tree. The argument is that this will give fewer node allocations and hopefully also a faster grammar, as there are fewer rules all together.
No functional requirements. Non-functional requirements: * no query result changes; * no performance degradation
Goal of this WL: provide base framework classes and a methodology to transform the current SQL parser to a pure bottom-up parser by small separate and safe steps. * grammar simplifications: ** de-duplication of similar rules, ** removal of useless intermediate rules that a) have a single RHS and b) the RHS consists of a single non-terminal symbol; * reducing the number of shift/reduce grammar conflicts, How the new grammar looks after the "bottom-up" refactoring: * No usage of semantic action at the beginning and in the middle of rule right hand sides (RHS). I.e. the only one semantic action is allowed for every RHS and it's permitted at the end of RHS only. * The only allowed things in the semantic action: ** memory allocation of a parse tree node, ** (very rare) feedback to the scanner. ** limited read-only access to things like thd->variables.character_set_client We defer and move all other semantic things out of grammar rules, i.e.: * most of error messaging, * everything that depends on the context: ** context-dependent creation of Items, ** allocating and filling of LEX/SELECT_LEX/SELECT_LEX_UNIT structures, * saving/restoring intermediate context-related information in LEX and Parser_state structures. Thus, after the refactoring the typical grammar rule looks like this: lhs: nonterminal1 TERMINAL2 TERMINAL3 nonterminal4 ... nonterminalN { $$= new(mem_root) X(@$, $1, $4, @4, ... $N); } | ... ... ; where X is a type of the particular parse tree node and "@$" is a Bison "location" of the RHS's beginning, or, the same with a helper NEW_PTN macro: lhs: nonterminal1 TERMINAL2 TERMINAL3 nonterminal4 ... nonterminalN { $$= NEW_PTN X(@$, $1, $4, @4, ... $N); } | ... ... ; And like this: lhs: TERMINAL1 { $$= TRIVIAL_CONSTANT1; } | TERMINAL2 { $$= TRIVIAL_CONSTANT2; } ... ; for trivial cases where the allocating of parse tree nodes is not reasonable. The topmost grammar rule returns the root object of a parse tree back to the caller via an additional parameter of the MYSQLparse() function: %parse-param { class THD *thd } %parse-param { Parse_tree_node **tree_root } %start topmost_rule ... topmost_rule: query { *tree_root= $1; } ; and the current definition of the MYSQLparse function: int MYSQLparse(void *thd); has changed to: int MYSQLparse(class THD * thd, Parse_tree_node **tree_root); with the help of the "%parse-param" Bison feature (bugfix for the bug#18017820 introduces the use of %parse-param). The refactored calling code at the parse_sql() function is: ... Parse_tree_node *parse_tree; bool mysql_parse_status= MYSQLparse(thd, &parse_tree) != 0; mysql_parse_status|= parse_tree->contextualize(thd); ... where the "parse_tree->contextualize(thd)" call does the contextualization of the parse tree, i.e. its transformation to our current abstract syntax tree (AST). In other word, this call wraps and subsequently executes all context-dependent and context-modifying code that we moved out from old semantic actions by previous steps of the refactoring. Transformation ============== Let's illustrate the idea and the transformation process by examples. The simplest case ----------------- simplest_nonterm: TERMINAL { some code that reads and modifies LEX } ; Generally speaking, a pure bottom-up parsing requires a pure context-independent grammar (including semantic actions). The semantic action above currently depends on the context that "calling" rules set, so let try to transform the grammar to remove that dependency. This sample rule doesn't depend on other terminals, so the way to make it context-independent is obvious: we can lift its semantic action up and apply it right after every reference to the "simplest_nonterm". Before the transformation: simplest_nonterm: TERMINAL { simplest_nonterm_code; } ; caller1: smth ... simplest_nonterm { code-1; } | smth ... simplest_nonterm TERMINAL1 { code-2; } | smth ... { code-3; } simplest_nonterm ; After the transformation: simplest_nonterm: TERMINAL {} ; caller1: smth ... simplest_nonterm { simplest_nonterm_code; code-1; } | smth ... simplest_nonterm TERMINAL1 { simplest_nonterm_code; code-2; } | smth ... { code-3; } simplest_nonterm { simplest_nonterm_code; } ; Obviously, the grammars above are semantically equivalent like inlined functions are equivalent to their non-inlined calls. More complex case ----------------- However, if the RHS of the "calling" rule is referencing other non-terminals after the simplest_nonterm: caller2: smth ... simplest_nonterm nonterm2 { code-20; } ; we have to keep in mind nonterm2's semantic actions: we can't just place simplest_nonterm_code before code-20, since simplest_nonterm_code may change the global context that nonterm2 may still depend on it. So, the more safe transformation is: caller2: smth ... simplest_nonterm { simplest_nonterm_code; } nonterm2 { code-20; } ; Unfortunately that transformation may be unsafe from the parser generator's point of view: we are inserting a new semantic action in the middle of the existing rule, so we can introduce a new grammar ambiguity (conflicts) here, since the new semantic action may disable some necessary lookahead. The best way to workaround such a trouble is to transform "nonterm2" first: Before: nonterm2: something_transofmed { nonterm2_code; } ; simplest_nonterm: TERMINAL { simplest_nonterm_code; } ; caller2: smth ... simplest_nonterm nonterm2 { code-20; } ; First transformation: nonterm2: something_transofmed {} ; simplest_nonterm: TERMINAL { simplest_nonterm_code; } ; caller2: smth ... simplest_nonterm nonterm2 { nonterm2_code; code-20; } ; Second transformation: nonterm2: something_transofmed {} ; simplest_nonterm: TERMINAL {} ; caller2: smth ... simplest_nonterm nonterm2 { simplest_nonterm_code; nonterm2_code; code-20; } ; So, the rule is: transform RHS from the rightmost nonterminal to the leftmost one where applicable. However, this rule doesn't help much in cases like this: caller2: simplest_nonterm nonterm2 { ... } | nonterm2 simplest_nonterm { ... } ; We can still find some workarounds there: transform simplest_nonterm and nonterm2 simultaneously, or make sure that one of them doesn't modify the context of other one and transform them in a safe order, or something else like that. OTOH we can use more general approach there like for recursive grammar (see below). Recursive case -------------- However, the grammar is not a tree but a cyclic graph, and we also have many cases of direct and indirect recursion, where the described method of semantic action code lifting doesn't work at all: // Direct recursion: nonterm3: nonterm3 '+' nonterm3 { code_31; } | TERMINAL3 { code_32; } ; // Indirect recursion between nonterm4 and nonterm5: nonterm4: TERMINAL4 nonterm5 { code_40; } ; nonterm5: nonterm4 { code_51; } | TERMINAL5 { code_52; } ; // Upper level: caller3: nonterm3 | nonterm4 ; The obvious solution of that problem has two-steps: 1. build a parse tree: defer the execution of current semantic action code and replace it with the allocation of parse tree node objects; 2. when the parsing is done and we have a complete parse tree, "interpret" the tree as a program: do a depth-first traversal and execute former semantic action code parts in a correct order. It seems logical to use Bison value stack to return newly create parse tree nodes from refactored rules. Intermediate transformation of the parser: // Direct recursion: nonterm3: nonterm3 '+' nonterm3 { $$= new Node_31($1, $3); } | TERMINAL3 { $$= new Node_32; } ; // Indirect recursion between nonterm4 and nonterm5: nonterm4: TERMINAL4 nonterm5 { $$= new Node_40($2); } ; nonterm5: nonterm4 { $$= new Node_51($1); } | TERMINAL5 { $$= new Node_52; } ; // Upper level: caller3: nonterm3 { $1->contextualize(); } | nonterm4 { $1->contextualize(); } ; Parse tree node classes (Node_31 etc) have: a) constructors that save subtrees as member variables for further use, b) the virtual "contextualize" function that: * calls "contextualize" functions on previously saved subtrees (depth-first traversal) and * executes former semantic actions from the corresponding rule. This approach does the work with a minimal overhead (MEM_ROOT allocations of small parse tree node objects are cheap). The only problem is a transformation of massive grammar parts with indirect recursions such as "expr" rule: it would be nice to split huge transformations in smaller reviewable and testable iterations. The framework introduces a trick to transform such large interlinked grammar parts step by step: the root class of parse tree node hierarchy has two kind of virtual function for the parse tree contextualization: 1. the regular "contextualize()" function for deferred execution of context-sensitive code; 2. the intermediate "contextualize_()" and "contextualizeN()" helper functions (let describe usage details of contextualizeN later). We use intermediate contextualize_() functions to wrap existent action code and to call the wrapped code immediately at the same old place: Before transformation: nonterm6: smth ... { code_60; } ; nonterm7: nonterm6 ... { code_70; } ; nonterm8: nonterm7 ... { code_80; } ; Intermediate stage: nonterm6: smth ... { $$= new Node_60(...); $$->contextualize_(); // calls code_60 } ; The contextualize() functions in their intermediate state do nothing: at the early stage we call them right after the contextualize_() wrappers: nonterm6: smth ... { $$= new Node_60(...); $$->contextualize_(); // calls code_60 $$->contextualize(); // does nothing: just a placeholder } ; then we lift contextualize() calls up (this is safe since they actually do nothing) but keep contextualize_() at the old place: nonterm6: smth ... { $$= new Node_60(...); $$->contextualize_(); // calls code_60 } ; nonterm7: nonterm6 ... { $1->contextualize(); // still does nothing, just a placeholder code_70; } ; ... then wrap that lifted contextualize() call together with the code of upper semantic actions: nonterm6: smth ... { $$= new Node_60(...); $$->contextualize_(); // calls code_60 } ; nonterm7: nonterm6 ... { $$= new Node_70($1, ...) $$->contextualize_(); // calls $1->contextualize() and code_70 $$->contextualize(); // does nothing, just another placeholder } ; ... ... and move newly created placeholder call up too: ... nonterm7: nonterm6 ... { $$= new Node_70($1, ...) $$->contextualize_(); // calls $1->contextualize() and code_70 } ; nonterm8: nonterm7 ... { $1->contextualize(); // new place for the placeholder code_80; } ; ... and so on. If there is a loop in the rule's RHS, we: 1. remember that rule and modify its action to return Parse_tree_node object (we simply ignore RHS' nonterminals for the first time); 2. when all of nonterminals in the RHS have converted to parse tree nodes, pass them as parameters to a parse tree node object constructor. Before the transformation: nonterm9: nonterm10 { code_90; } ; nonterm10: FOO nonterm9 { code_101; } | BAR { code_102; } ; Step 1: // Decorate "nonterm9" as parse tree node, but remember that its // transformation is incomplete: nonterm9: nonterm10 { $$= new Node_90; // don't care about nonterm10 in RHS: TBD later $$->contextualize_(); // code_90; TODO: call $1->contextualize() $$->contextualize(); // does nothing, lift this up on next step } ; nonterm10: FOO nonterm9 { code_101; } | BAR { code_102; } ; caller11: nonterm9 { code_110; } ; Step 2: // Lift the contextualize() call up from "nonterm9" RHS to "calling" // rules: nonterm9: nonterm10 { $$= new Node_90; // don't care about nonterm10 in RHS: TBD later $$->contextualize_(); // code_90; TODO: call $1->contextualize() } ; nonterm10: FOO nonterm9 { $2->contextualize(); code_101; } | BAR { code_102; } ; caller11: nonterm9 { $1->contextualize(); code_110; } ; Step 3: // Transform "nonterm10" rule right hand sides to allocate parse tree // nodes: nonterm9: nonterm10 { $$= new Node_90; // don't care about nonterm10 in RHS: TBD later $$->contextualize_(); // code_90; TODO: call $1->contextualize() } ; nonterm10: FOO nonterm9 { $$= new Node_101($2); $$->contextualize_(); // calls $2->contextualize() and code_101 $$->contextualize(); // does nothing, lift up this on next step } | BAR { $$= new Node_102; $$->contextualize_(); // calls code_102 $$->contextualize(); // does nothing, lift up this on next step } ; caller11: nonterm9 { $1->contextualize(); code_110; } ; Step 4: // Lift contextualize() calls from noterm10 back to the its "caller" // nonterm9 rule: nonterm9: nonterm10 { $$= new Node_90($1); $$->contextualize_(); // call $1->contextualize() and code_90 } ; nonterm10: FOO nonterm9 { $$= new Node_101($2); $$->contextualize_(); // call code_101 and $2->contextualize() } | BAR { $$= new Node_102; $$->contextualize_(); // call code_102 } ; caller11: nonterm9 { $1->contextualize(); code_110; } ; Final step: // a. Remove all calls to contextualize_() in semantic actions; // b. Replace all implementations of contextualize() with bodies of // corresponding contextualize_() functions. nonterm9: nonterm10 { $$= new Node_90($1); } ; nonterm10: FOO nonterm9 { $$= new Node_101($2); } | BAR { $$= new Node_102; } ; caller11: nonterm9 { $1->contextualize(); code_110; } ; At this step the part of the grammar behind the nonterm9 rule is completely context-free, and the caller11 rule can do a deferred contextualization. Of course, this approach works for very long indirect recursion paths like the "expr" rule too. Transformation of mid-rule semantic actions ------------------------------------------- The last remaining complexity in the grammar is a presence of semantic actions at the beginning and in the middle of RHS: nonterm12: { code_120; } foo { code_121; } bar { code_122; } ; In the common case we can't just merge code parts above into a single contextualize_() call before making "foo" and "bar" truly context-free. So, if "foo" or "bar" refers "nonterm12" in some direct or indirect way, we have to defer the transformation of the "nonterm12" till the complete transformation of the "nonterm12 <--> foo" or/and "nonterm12 <--> bar" loops. If there are few interlinked cases like that, we have to defer the transformation of their multi-action rules too and then finally transform all of them at once. To make sure that we don't forget anything, we transform such rules with the help of additional contextualizeN() functions: %union { ... class Node_120 *node_120; } ... nonterm12: { $$= Node_120; // save the result into intermediate $1 $ $->contextualize1(); // call code_120 } foo { // save "foo" ($2) like constructor usually does and call code_121 $ 1->contextualize2($2); } bar { $ 1->contextualize3($4); // save bar($4) and call code_122 $$= $ 1; $$->contextualize(); // do nothing, just a placeholder } ; Right before the final step of the cyclic grammar part transformation we modify every involved multi-action rule this way: 1. merge all contextualizeN() functions into single contextualize_(), 2. move parameter passing from that contextualize_() to the constructor, so rules get the desired single-action look: nonterm12: foo bar { $$= Node_120($1, $2); $$->contextualize_(); // calls code_120, code_121 and code_122 } ; Uncovered topics ================ 1. Error messaging (see LLD for details). 2. Items: the framework helps to reuse Item object where applicable as a parse tree nodes (see LLD for details). 3. Typed nonterminals. All examples above shows the transformation of untyped rules (the type of RHS is NONE). However, many of nonterminals besides Items have a typed value. To return that value to "callers" we just embed a typed public member variable into a corresponding parse tree node class. We fill that variable with a value in the contextualize_/contextualize function. Before the transformation: %union { ... Some_type some_type; } %type some_node some_rule: ... { $$= some_value; } ; caller: some_rule ... { do something with some_value in $1; } ; After the transformation: %union { ... class Some_node *some_node; } %type some_node some_rule: ... { $$= new Some_node(...); } ; caller: some_rule ... { $1->contextualize(); do something with some_value in $1->value; } ;
Parse tree hierarchy root class =============================== All parse tree node classes are derived from the Parse_tree_node class: class Parse_tree_node { ... private: /* False right after the node allocation. The contextualize/contextualize_ function turns it into true. */ #ifndef DBUG_OFF bool contextualized; #endif//DBUG_OFF /* Intermediate helper variable for the transformation debugging. True after the contextualize_() function call, i.e. "true" means, that the grammar rule that allocates this node is in the intermediate state of the refactoring and still context-sensitive. */ bool transitional; // TODO: remove that after parser refactoring public: /* Memory allocation operator are overloaded to use mandatory MEM_ROOT parameter for cheap thread-local allocation. Note: We don't process memory allocation errors in refactored semantic actions: we defer OOM error processing like other error parse errors and process them all at the contextualization stage of the resulting parse tree. */ static void *operator new(size_t size, MEM_ROOT *mem_root) throw () { return alloc_root(mem_root, size); } static void operator delete(void *ptr,size_t size) { TRASH(ptr, size); } static void operator delete(void *ptr, MEM_ROOT *mem_root) {} protected: Parse_tree_node() { #ifndef DBUG_OFF contextualized= false; transitional= false; #endif//DBUG_OFF } public: ... /* True if contextualize/contextualized function has done: */ #ifndef DBUG_OFF bool is_contextualized() const { return contextualized; } #endif//DBUG_OFF /* Contextualization function to overload by subclasses. This class' contextualize() function just checks the call stack for the overflow, since the contextualization algorithm requires a recursive depth-first traversal of a parse tree. All overloaded "contextualize" functions call this function of the hierarchy root (directly or indirectly). */ virtual bool contextualize(THD *thd); /** Intermediate version of the contextualize() function This function is intended to resolve parser grammar loops. During the step-by-step refactoring of the parser grammar we wrap each context-sensitive semantic action with 3 calls: 1. Parse_tree_node() context-independent constructor call, 2. contextualize_() function call to evaluate all context-sensitive things from the former context-sensitive semantic action code. 3. Call of dummy contextualize() function. Then we lift the contextualize() function call to outer grammar rules but save the contextualize_() function call untouched. When all loops in the grammar rules are resolved (i.e. transformed as described above) we: a. remove all contextualize_() function calls and b. rename all contextualize_() function definitions to contextualize() function definitions. Note: it's not necessary to transform the whole grammar and remove this function calls in one pass: it's possible to transform the grammar statement by statement in a way described above. */ virtual bool contextualize_(THD *thd) { DBUG_ASSERT(!contextualized && !transitional); transitional= true; contextualized= true; return false; } /** my_parse_error() function replacement for deferred reporting of parse errors @param thd current THD @param pos location of the error in lexical scanner buffers */ void error(THD *thd) const; }; Items ===== We don't want unnecessary CPU/memory overhead, so we reuse Item object as parse tree node objects where applicable. So, we derive the Item hierarchy root class from the Parse_tree_node class: class Item : public Parse_tree_node { protected: /* The is_parser_item field is to distinguish the way that we used to allocate the Item object. If we allocate the Item class object (actually, its successor) directly from the parser grammar, we need to use the context-independent Item(POS) constructor (see below). That constructor sets the is_parser_item field to true. Other (context-sensitive) Item constructors must set the is_parser_item field to false. In many Item successor classes context-sensitive constructors share the same code with the overloaded itemize() function, so if we call such a constructor and apply the itemize() function, we can execute that shared code twice unexpectedly. With the help of the is_parser_item value we can recognize not parser-allocated Item object in the itemize() function and return immediately with success. */ const bool is_parser_item; // true if allocated directly by the parser public: ... /* Traditional constructor for Items: don't call it from parse tree nodes! Note: this constructor sets the "contextualized" flag to true. */ Item(); /* Extra constructor for parse tree node-aware Item classes. This constructor is a copy of the previous constructor excluding context-sensitive parts (free_list processing etc.): that parts go to the contextualize_/contextualize functions. The same thing we do when convert other Items to parse tree nodes: constructors do mostly nothing but save input parameters, contextualize/contextualize_ functions evaluate the rest of former constructor's stuff. */ explicit Item(POS); ... /* The grammar rules that return Items in the old grammar still return Items in the refactored grammar. Some of these new Items are regular Item objects (we allocate that Items only once in the grammar where applicable). But some new Items are intermediate parse tree node objects, since we unable to decide what exact Item is needed to allocate during the context-free parsing. So we defer the allocation of some Item objects till the contextualization stage. The common contextualize/contextualize_ function call format is not suitable for such intermediate Items: sometimes we have to substitute them with other Item objects. Thus, we hide common contextualize/contextualize_ function and redefine them as itemize/_itemize functions to add an extra parameter like the fix_field function has. So, for parse tree Items we call these functions this way: item->itemize(thd, &item); */ private: /* Hide the contextualize*() functions: call/override the itemize()/itemize_() in Item class tree instead. */ virtual bool contextualize(THD *thd) { DBUG_ASSERT(0); return true; } virtual bool contextualize_(THD *thd) { DBUG_ASSERT(0); return true; } protected: /** Helper function to skip itemize()/itemize_() for grammar-allocated items @param [out] res pointer to "this" @retval true can skip itemize() @retval false can't skip: the item is allocated directly by the parser */ bool skip_itemize(Item **res) { *res= this; return !is_parser_item; } public: /* The same as contextualize()/contextualize_() but with additional parameter This function finalize the construction of Item objects (see the Item(POS) constructor): we can access/change parser contexts from the itemize() function. @param thd current THD @param [out] res pointer to "this" or to a newly allocated replacement object to use in the Item tree instead @retval false success @retval true syntax/OOM/etc error */ virtual bool itemize(THD *thd, Item **item) { if (skip_itemize(res)) return false; return super::contextualize(thd); } /* Intermediate version of the itemize() function During the step-by-step refactoring of the parser grammar we replace each context-sensitive Item() constructor call with 3 calls: 1. Item(POS) context-independent constructor call, 2. itemize_() function call to evaluate all context-sensitive things from former context-sensitive Item() constructor, 3. Call of dummy itemize() function. Then we lift the itemize() function call to outer grammar rules but save the itemize_() function call untouched. When all loops in the Item grammar rules are resolved (i.e. transformed as described above) we: a. remove all itemize_() function calls and b. rename all itemize_() function definitions to itemize() function definitions. */ virtual bool _itemize(THD *thd, Item **res) { if (skip_itemize(res)) return false; ... } ... }; NEW_PTN macro ============== The NEW_PTN is just a wrapper aroung the "new (YYTHD->mem_root)" operator. Lists ===== Left-recursive lists are very common in the grammar: list: smth { head_code; } | list ',' smth { tail_code; } ; so we introduce some helper classes and macros for their refactoring. Item lists ---------- Typically an Item list has an uniform declaration: %union { ... Item *item; List- *item_list; } %type
- smth %type
list list: smth { $$= new List - ; $$->push_back($1); check for OOM errors; } | list ',' smth { $$->push_back($3); check for OOM errors; } ; It seems reasonable to not introduce recursive Parse_tree_node wrappers for each of "list" rule brances: we can use List
- to save parser-allocated Items and then itemize that Items in place. (After the refactoring we are going to replace List
lists with memory-effective Mem_root_array objects). After the refactoring that list looks this way: %union { ... Item *item; PT_item_list *item_list; } %type - smth %type
list list: smth { $$= NEW_PTN PT_item_list; if ($$ == NULL || $$->push_back($1)) MYSQL_YYABORT; } | list ',' smth { if ($1 == NULL || $1->push_back($3)) MYSQL_YYABORT; $$= $1; } ;
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.