There is no “Stored Program Parser” as such, there is
only one parser in the SQL layer in the server. This parser is
capable of understanding every SQL statement, including statements
related to Stored Programs. The parser is implemented as an
ascendant parser, using bison. The source code is located in the
file sql/sql_yacc.yy.
The parts of the parser dedicated more specially to Stored Programs are starting at the following rules:
CREATE PROCEDURE : see rule
sp_tail,
CREATE FUNCTION : see rule
sp_tail,
CREATE TRIGGER : see rule
trigger_tail,
CREATE EVENT : see rule
event_tail.
In every case, the parser reads the SQL text stream that
represents the code as input, and creates an internal
representation of the Stored Program as output, with one C++
object of type sp_head. A limiting consequence
of this approach is that a stored program does not support
nesting: it is impossible to embed one CREATE
PROCEDURE into another, since the parser currently may
only support one sp_head object at a time.
Conceptually, there are many different layers involved during parsing:
Lexical analysis (making words or tokens from a character stream),
Syntactic analysis (making "sentences" or an abstract syntax tree from the tokens),
Semantic analysis (making sure these sentences do make sense),
Code generation (for compilers) or evaluation (for interpreters).
From the implementation point or view, many different concepts from different layers actually collide in the same code base, so that the actual code organization is as follows:
The lexical analysis is implemented in
sql/sql_lex.cc, as when parsing regular
statements.
Syntactic analysis, semantic analysis, and code generation -- all of them -- are done at once, during parsing of the code. From that perspective, the parser behaves as a single pass compiler. In other words, both the code and the symbol table for the final result are generated on the fly, interleaved with syntactic analysis.
This is both very efficient from a performance point of view, but difficult to understand, from a maintenance point of view.
Let's illustrate for example how the following SQL statement is parsed:
DECLARE my_cursor CURSOR FOR SELECT col1 FROM t1;
The corresponding part of the grammar in the parser for DECLARE CURSOR statements is the following (with annotated line numbers):
[ 1] sp_decl:
[ 2] DECLARE_SYM ident CURSOR_SYM FOR_SYM sp_cursor_stmt
[ 3] {
[ 4] LEX *lex= Lex;
[ 5] sp_head *sp= lex->sphead;
[ 6] sp_pcontext *ctx= lex->spcont;
[ 7] uint offp;
[ 8] sp_instr_cpush *i;
[ 9]
[10] if (ctx->find_cursor(&$2, &offp, TRUE))
[11] {
[12] my_error(ER_SP_DUP_CURS, MYF(0), $2.str);
[13] delete $5;
[14] MYSQL_YYABORT;
[15] }
[16] i= new sp_instr_cpush(sp->instructions(), ctx, $5,
[17] ctx->current_cursor_count());
[18] sp->add_instr(i);
[19] ctx->push_cursor(&$2);
[20] $$.vars= $$.conds= $$.hndlrs= 0;
[21] $$.curs= 1;
[22] }
[23] ;
The lines [1], [2] and [23] are bison code that express the structure of the grammar. These lines belong to the syntactic parsing realm.
The lines [3] and [22] are bison delimiters for the associated
action to execute, when parsing of the syntax succeeds.
Everything between lines [3] and [22] is C++ code, executed when
the parser finds a syntactically correct DECLARE
CURSOR statement.
The lines [4] to [8] could be considered syntactic parsing: what
the code does is find what is the current Stored Program being
parsed, find the associated part of the syntax tree under
construction (sp_head), and find the
associated current context in the symbol table
(sp_pcontext).
Note that there is some black magic here: since we are still
currently parsing the content of a Stored
Program (the DECLARE CURSOR statement), the
final “syntax” tree for the Stored Program
(sp_head) is not supposed to exist yet. The
reason the sp_head object is already
available is that the actions in the CREATE
PROCEDURE, CREATE FUNCTION,
CREATE TRIGGER, or CREATE
EVENT are implemented as a
descendant parser (it created an empty
sp_head object first, filling the content
later). Mixing code that way (descendant actions with ascendant
parsing) is extremely sensitive to changes.
The line [10] is a semantic check. The statement might be syntactically correct (it parsed), but to be semantically correct, the name or the cursor must be unique in the symbol table.
Line [12] is reporting a semantic error back to the client (duplicate cursor). The code at line [14] forces the syntactic parser (bison) to abort.
By line [16], we have verified that the code is syntactically
valid, and semantically valid: it's now time for code
generation, implemented by creating a new
sp_instr_cpush to represent the cursor in the
compiled code. Note that variable allocation is done on the fly,
by looking up the current cursor count in the symbol table
(sp_pcontext::current_cursor_count()).
Line [18] adds the generated code to the object representing the stored program (code generation).
Line [19] maintains the symbol table (semantic parsing) by adding the new cursor in the current local context.
Lines [20] and [21] return to bison a fragment of a fake syntax tree, indicating that one cursor was found.
By looking at the complete implementation of this action in
bison, one should note that the target code was generated, the
symbol table for the Stored Program was looked up and updated,
while at no point in time a syntax node was even created. Note
that the sp_instr_cpush object should really
be considered generated code: the fact that there is a
one-to-one correspondence with the syntax is incidental.
All the code generated by the parser is emitted in a single pass. For example, consider the following SQL logic:
CREATE FUNCTION func_4(i int)
RETURNS CHAR(10)
BEGIN
DECLARE str CHAR(10);
CASE i
WHEN 1 THEN SET str="1";
WHEN 2 THEN SET str="2";
WHEN 3 THEN SET str="3";
ELSE SET str="unknown";
END CASE;
RETURN str;
END$$
The compiled program for this Stored Function is:
SHOW FUNCTION CODE func_4; Pos Instruction 0 set str@1 NULL 1 set_case_expr (12) 0 i@0 2 jump_if_not 5(12) (case_expr@0 = 1) 3 set str@1 _latin1'1' 4 jump 12 5 jump_if_not 8(12) (case_expr@0 = 2) 6 set str@1 _latin1'2' 7 jump 12 8 jump_if_not 11(12) (case_expr@0 = 3) 9 set str@1 _latin1'3' 10 jump 12 11 set str@1 _latin1'unknown' 12 freturn 254 str@1
Note the instruction at position 4: jump 12.
How can the compiler generate this instruction in a single pass,
when the destination (12) is not known yet ? This instruction is
a forward jump. What happens during code
generation is that, by the time the compiler has generated the
code for positions [0] to [11], the generated code looks like
this:
Pos Instruction 0 set str@1 NULL 1 set_case_expr ( ?? ) 0 i@0 2 jump_if_not 5( ?? ) (case_expr@0 = 1) 3 set str@1 _latin1'1' 4 jump ?? 5 jump_if_not 8( ?? ) (case_expr@0 = 2) 6 set str@1 _latin1'2' 7 jump ?? 8 jump_if_not 11( ?? ) (case_expr@0 = 3) 9 set str@1 _latin1'3' 10 jump ?? 11 set str@1 _latin1'unknown' ...
The final destination of the label for the END
CASE is not known yet, and the list of all the
instructions (1, 2, 4, 5, 7, 8 and 10) that need to point to
this unknown destination (represented as ??)
is maintained in a temporary structure used during code
generation only. This structure is called the context back patch
list.
When the destination label is finally resolved to a destination
(12), all the instructions pointing to that label, which have
been already generated (but with a bogus destination) are
back patched to point to the correct
location. See the comments marked BACKPATCH
in the code for more details.
As a side note, this generated code also shows that some
temporary variables can be generated implicitly, such as the
operand of the CASE expression, labeled
case_expr@0.
Numbering of case expressions in the symbol table uses a
different namespace than variables, so that
case_expr@0 and i@0 are
two different variables, even when both internally numbered
with offset zero.
