MySQL Internals Manual  /  ...  /  Parser Structure

16.4.1 Parser Structure

Conceptually, there are many different layers involved during parsing:

  • Lexical analysis (making words or tokens from a character stream),

  • Syntactic analysis (making "sentences" or an abstract syntax tree from the tokens),

  • Semantic analysis (making sure these sentences do make sense),

  • Code generation (for compilers) or evaluation (for interpreters).

From the implementation point or view, many different concepts from different layers actually collide in the same code base, so that the actual code organization is as follows:

  • The lexical analysis is implemented in sql/, as when parsing regular statements.

  • Syntactic analysis, semantic analysis, and code generation -- all of them -- are done at once, during parsing of the code. From that perspective, the parser behaves as a single pass compiler. In other words, both the code and the symbol table for the final result are generated on the fly, interleaved with syntactic analysis.

This is both very efficient from a performance point of view, but difficult to understand, from a maintenance point of view.

Let's illustrate for example how the following SQL statement is parsed:


The corresponding part of the grammar in the parser for DECLARE CURSOR statements is the following (with annotated line numbers):

[ 1] sp_decl:
[ 2]   DECLARE_SYM ident CURSOR_SYM FOR_SYM sp_cursor_stmt
[ 3]   {
[ 4]     LEX *lex= Lex;
[ 5]     sp_head *sp= lex->sphead;
[ 6]     sp_pcontext *ctx= lex->spcont;
[ 7]     uint offp;
[ 8]     sp_instr_cpush *i;
[ 9]
[10]     if (ctx->find_cursor(&$2, &offp, TRUE))
[11]     {
[12]       my_error(ER_SP_DUP_CURS, MYF(0), $2.str);
[13]       delete $5;
[14]       MYSQL_YYABORT;
[15]     }
[16]     i= new sp_instr_cpush(sp->instructions(), ctx, $5,
[17]                           ctx->current_cursor_count());
[18]     sp->add_instr(i);
[19]     ctx->push_cursor(&$2);
[20]     $$.vars= $$.conds= $$.hndlrs= 0;
[21]     $$.curs= 1;
[22]   }
[23] ; 

The lines [1], [2] and [23] are bison code that express the structure of the grammar. These lines belong to the syntactic parsing realm.

The lines [3] and [22] are bison delimiters for the associated action to execute, when parsing of the syntax succeeds. Everything between lines [3] and [22] is C++ code, executed when the parser finds a syntactically correct DECLARE CURSOR statement.

The lines [4] to [8] could be considered syntactic parsing: what the code does is find what is the current Stored Program being parsed, find the associated part of the syntax tree under construction (sp_head), and find the associated current context in the symbol table (sp_pcontext).

Note that there is some black magic here: since we are still currently parsing the content of a Stored Program (the DECLARE CURSOR statement), the final syntax tree for the Stored Program (sp_head) is not supposed to exist yet. The reason the sp_head object is already available is that the actions in the CREATE PROCEDURE, CREATE FUNCTION, CREATE TRIGGER, or CREATE EVENT are implemented as a descendant parser (it created an empty sp_head object first, filling the content later). Mixing code that way (descendant actions with ascendant parsing) is extremely sensitive to changes.

The line [10] is a semantic check. The statement might be syntactically correct (it parsed), but to be semantically correct, the name or the cursor must be unique in the symbol table.

Line [12] is reporting a semantic error back to the client (duplicate cursor). The code at line [14] forces the syntactic parser (bison) to abort.

By line [16], we have verified that the code is syntactically valid, and semantically valid: it's now time for code generation, implemented by creating a new sp_instr_cpush to represent the cursor in the compiled code. Note that variable allocation is done on the fly, by looking up the current cursor count in the symbol table (sp_pcontext::current_cursor_count()).

Line [18] adds the generated code to the object representing the stored program (code generation).

Line [19] maintains the symbol table (semantic parsing) by adding the new cursor in the current local context.

Lines [20] and [21] return to bison a fragment of a fake syntax tree, indicating that one cursor was found.

By looking at the complete implementation of this action in bison, one should note that the target code was generated, the symbol table for the Stored Program was looked up and updated, while at no point in time a syntax node was even created. Note that the sp_instr_cpush object should really be considered generated code: the fact that there is a one-to-one correspondence with the syntax is incidental.

User Comments
Sign Up Login You must be logged in to post a comment.