WL#5257: Create a first API for the Optimizer trace

Status: Complete

Description
High Level Architecture
Low Level Design

This is a sub-piece of WL#4800 "Optimizer trace/debugger".
It is only about defining the interface that the "optimizer trace" module will
expose to the rest of the Server sql/ code.
Once the interface is developed and approved, the Optimizer team will be able to
use this interface to add more tracing points into the optimizer code, which is
WL#5842. Though the present WL will already add some tracing points as a proof
of concept.

This task is about creating a first version of the API. Tracing will be mainly
for the use of the Optimizer team and Bug Verification team. We want to be
allowed to change the shape and content of the trace at will at least for a long
undefined time until we are satisfied with it, without caring for
backward-compatibility, or with users' applications which may read and parse the
trace. That is mainly because we are not sure up-front of what the best tracing is.

The prototype, originally by Sergey Petrunia, is located on our usual internal
central host in tree mysql-next-mr-opt-backporting-wl4800 .

Tracing is per-session (you can enable tracing for your session only; you can
see only your statement's trace). Trace is generated as a statement executes,
kept available for the user to read it, and deleted when not needed (see
"offset" and "limit" further down for defining this moment).

Trace creation is incremental: the trace is built piece after piece: we first
put into it information about constant tables detection, then about ranges, etc,
following the existing flow of the Optimizer code.
When the statement has finished executing, the trace is readable with
  select * from information_schema.OPTIMIZER_TRACE;

Taking an example, this query:
SELECT (SELECT 1 FROM t6 WHERE d = c) AS RESULT FROM t5 ;
gives this trace:

  "steps": [
    {
      "join_preparation": {
        "select#": 1,
        "steps": [
          {
            "join_preparation": {
              "select#": 2,
              "steps": [
                {
                  "expanded_query": "/* select#2 */ select 1 from `test`.`t6`
where (`test`.`t6`.`d` = `test`.`t5`.`c`)"
                }
              ]
            }
          },
          {
            "expanded_query": "/* select#1 */ select (/* select#2 */ select 1
from `test`.`t6` where (`test`.`t6`.`d` = `test`.`t5`.`c`)) AS `RESULT` from
`test`.`t5`"
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          {
            "records_estimation": [
              {
                "database": "test",
                "table": "t5",
                "records": 1,
                "cost": 1,
                "table_type": "system"
              }
            ]
          },
          {
            "attaching_conditions_to_tables": {
              "original_condition": null,
              "attached_conditions_computation": [
              ],
              "attached_conditions_summary": [
              ]
            }
          },
          {
            "refine_plan": [
            ]
          }
        ]
      }
    },
    {
      "join_execution": {
        "select#": 1,
        "steps": [
          {
            "subselect_execution": {
              "select#": 2,
              "steps": [
                {
                  "join_optimization": {
                    "select#": 2,
                    "steps": [
                      {
                        "condition_processing": {
                          "condition": "WHERE",
                          "original_condition": "(`test`.`t6`.`d` = NULL)",
                          "steps": [
                            {
                              "transformation": "equality_propagation",
                              "resulting_condition": "multiple equal(NULL,
`test`.`t6`.`d`)"
                            },
                            {
                              "transformation": "constant_propagation",
                              "resulting_condition": "multiple equal(NULL,
`test`.`t6`.`d`)"
                            },
                            {
                              "transformation": "trivial_condition_removal",
                              "resulting_condition": "multiple equal(NULL,
`test`.`t6`.`d`)"
                            }
                          ]
                        }
                      },
                      {
                        "ref_optimizer_key_uses": [
                          {
                            "database": "test",
                            "table": "t6",
                            "field": "d",
                            "equals": "NULL",
                            "null_rejecting": true
                          }
                        ]
                      },
                      {
                        "records_estimation": [
                          {
                            "database": "test",
                            "table": "t6",
                            "range_analysis": {
                              "table_scan": {
                                "records": 2,
                                "cost": 4.5034
                              },
                              "potential_range_indices": [
                                {
                                  "index": "d",
                                  "usable": true,
                                  "key_parts": [
                                    "d"
                                  ]
                                }
                              ],
                              "best_covering_index_scan": {
                                "index": "d",
                                "cost": 1.4233,
                                "chosen": true
                              },
                              "setup_range_conditions": [
                                {
                                  "impossible_condition": {
                                    "cause": "comparison_with_null_always_false"
                                  }
                                }
                              ],
                              "impossible_range": true
                            },
                            "records": 0,
                            "cause": "impossible_where_condition"
                          }
                        ]
                      }
                    ],
                    "empty_result": {
                      "cause": "no matching row in const table"
                    }
                  }
                },
                {
                  "join_execution": {
                    "select#": 2,
                    "steps": [
                    ]
                  }
                }
              ]
            }
          }
        ]
      }
    }
  ]
}

The format above is JSON. More details about JSON are in the Low Level Design;
in short, a JSON object is an unordered set of key/value pairs (like a Perl
associative array), and a JSON array is an ordered set of values (like a Perl
list). The trace starts with a top object which has one key/value pair: the key
is "steps", the value is an object, made of several key/value pairs
which are: an object for key "join_preparation", one for key
"join_optimization", etc. Each such object also contains sub-objects or arrays,
and so on, down to the simplest type which is scalars (string, number, boolean).

Such trace is filled only if the user has enabled it with "SET
SESSION.OPTIMIZER_TRACE=enabled=on"; otherwise trace is not created, to save CPU
time and memory.
@@OPTIMIZER_TRACE is a "set of flags" variable (like @@OPTIMIZER_SWITCH), it has
one master flag to enable/disable tracing ("enabled=on|off") and then additional
flags to fine-tune the trace (described below).

The trace is not only about SELECT statements: trace is generated for
  SELECT,
  EXPLAIN SELECT,
  INSERT or REPLACE ( with VALUES or SELECT),
  UPDATE/DELETE and their multi-table variants,
  SET (unless it manipulates @@@@optimizer_trace),
  DO,
  DECLARE/CASE/IF/RETURN (stored routines language elements).

QA's automated tests add the requirement that keys can contain only characters
in [a-zA-Z_-]. QA wants to automatically verify that a test is hitting a certain
optimization so they will inspect keys and values. Likely they will have to
parse the JSON output into an object of the test's native language and inspect
this object ("does the object have property X and is it set to true?"); Python
and Perl have modules to do such parsing.

Requirement from Support: EXPLAIN SELECT should suffice to produce the trace;
they may not always have the option to execute SELECT on the customers' machine,
especially if the query is slow.

We expect tracing to be rarely turned on, only when there is a DBA/user to look
at it, and only for few sessions at a time.

Trace can be un-compiled with a "cmake" flag (like
cmake -DOPTIMIZER_TRACE=0). It is expected that release binaries will
have it compiled in. A consequence is that the most performance-critical common
case is: release binary with tracing compiled in but not enabled at runtime. It
must show a performance loss as small as possible compared to a vanilla tree not
having this WL.

We don't want to have to duplicate code like this:
  DBUG_PRINT("info",("cost %g",cost));
  OPT_TRACE(thd->opt_trace, add("cost",cost));
Thus, any optimizer trace operation, *even* if tracing is run-time disabled,
has an implicit DBUG_PRINT inside. This way, only the second line above is
needed, and several DBUG_PRINT() could be removed from the Optimizer code. A
debug binary will refuse to build if optimizer trace is not compiled in.

In case of crash (segfault, assertion), the stack-trace printer will print the
optimizer trace to the error log, as a debugging help.

As the trace grows, it needs more space, allocated with malloc() or realloc();
if such OS functions fail to allocate memory, the statement fails (this is a
constraint of the implementation).

New variables
=============

1)  --optimizer-trace=name 
                      Controls tracing of the Optimizer:
                      optimizer_trace=option=val[,option=val...], where option
                      is one of {enabled, end_marker, one_line} and val is one
                      of {on, off, default}

All flags are off by default.
1a) "enabled": turns trace on or off for the session.
1b) "end_marker":
- when "off", a key/value pair, where the value is an object, looks like:
"key": {
content of object
       }
- when "on", it looks like:
"key": {
content of object
       } /* key */
the interest is that when "content of object" is large, many lines, it's hard to
see what object "}" closes. Reminding of the key of the closing object is a way
to help. But note that this isn't JSON-compliant.
"
1c) "one_line": when "off" the trace looks like the long example above; when
"on", no whitespace is put in the trace: no "\n", no indentation; this is not
readable by a human but perfectly readable by JSON parsing programs; the only
interest is to decrease memory usage and client/server transfer.

2)  --optimizer-trace-features=name 
                      Enables/disables tracing of selected features of the
                      Optimizer:
                      optimizer_trace_features=option=val[,option=val...],
                      where option is one of {greedy_search, range_optimizer,
                      dynamic_range, repeated_subselect} and val is one of {on,
                      off, default}

All flags are on by default.
2a) "greedy_search": when "off", the search for a best order of tables (you
know, the optimizer has to decide whether accessing t1 then t2 is better than
accessing t2 then t1) is not traced; this may avoid explosion of the trace for
many-table joins
2b) "range_optimizer": when "off", range optimizer is not traced
2c) "dynamic_range": when "off", the calls to the range optimizer during
execution (when EXPLAIN says "range checked for each record") are traced only
the first time
2d) "repeated_subselect": when "off", the multiple executions of a subquery are
traced only the first time.

3)  --optimizer-trace-max-mem-size=# 
                      Maximum allowed cumulated size of stored optimizer traces

Default is 16kB.
A 20-table query like in BUG#50595 (file bug.test there) generates a trace which
is bigger than 4GB which is unreasonable to fit in memory; the culprit is
greedy search. A configurable limit on the memory size of the trace is
implemented, controllable by a session variable (with a low default to protect
the innocent user). If the limit is exceeded, the trace is truncated, and a
column MISSING_BYTES_BEYOND_MAX_MEM_SIZE of the
INFORMATION_SCHEMA.OPTIMIZER_TRACE table shows how many bytes could not be
written to the trace (if no truncation, the column contains 0). A smarter system
may be done in WL#5297.

4)
  --optimizer-trace-limit=# 
                      Maximum number of shown optimizer traces
(default 1)
  --optimizer-trace-offset=# 
                      Offset of first optimizer trace to show; see manual
(default -1)

When we create a trace, the previous trace is automatically deleted.
So when executing a stored routine, only the trace of the execution's end
will be visible: all the rest is lost.
To allow the user to require tracing for a certain statement inside a stored
routine (assuming the user cannot change the routine's code to insert SELECT *
FROM INFORMATION_OPTIMIZER_TRACE into it), we will support
SET optimizer_trace_offset=OFS,optimizer_trace_limit=LIM
- if OFS>=0: the OPTIMIZER_TRACE table should contain traces of LIM statements
after the OFSth statement counting from when tracing was enabled (i.e.
optimizer_trace=enabled=on was issued);
- if OFS<0: the OPTIMIZER_TRACE table should contain traces of LIM statements
after the (-OFS)th statement counting backwards from the SELECT on
OPTIMIZER_TRACE.
- any statement reading OPTIMIZER_TRACE does not count
- the default is OFS=-1: boils down to the last executed statement.
The memory limit on optimizer trace's size applies to the concatenation of all
which the implementation needs to keep in memory:
- if OFS>=0, it means LIM traces
- if OFS<0, it means (-OFS) traces, because, even though [-OFS,-OFS+LIM[ is
shown in the table, [-OFS+LIM+1,last] also need to be in memory as they will be
shown in the table after executing a few more statement ("the window slides").
If the memory limit is exceeded, the last ones won't be traced, and the
MISSING_BYTES_BEYOND_MAX_MEM_SIZE column of
INFORMATION_SCHEMA.OPTIMIZER_TRACE will tell count of missing bytes (this limit
is cumulative over all remembered traces). If OFS>=0,
tracing will stop. If OFS<0, a next statement will cause the oldest kept trace
to be purged, which will make room.

Security
========

When tracing is enabled, additional security checks are done; docs team: ping me
and I'll explain (it's a bit complicated).

Columns in INFORMATION_SCHEMA.OPTIMIZER_TRACE
=============================================

"QUERY" (string): the verbatim text of the query
"TRACE" (string): trace of this query
"MISSING_BYTES_BEYOND_MAX_MEM_SIZE": how many bytes of trace have been thrown
away because the optimizer-trace-max-mem-size limit was reached.
"INSUFFICIENT_PRIVILEGES": is 1 if the trace cannot be shown due to insufficient
privileges for the user.

Misc points for documentation
=============================

This should be mentioned in the doc of the feature:
- if people want to save their trace to a file they should use
SELECT INTO DUMPFILE instead of SELECT INTO OUTFILE (OUTFILE escapes '\n',
making the trace non-JSON compliant).
This should be mentioned in the manual:
- EXPLAIN EXTENDED, when it prints the query as a warning, now prints numbers of
"pieces" in the form of /*select#X*/; like:
"/* select#1 */ select (/* select#2 */ select 1 from `test`.`t6` where
(`test`.`t6`.`d` = `test`.`t5`.`c`)) AS `RESULT` from `test`.`t5`"
These select#X correspond to the "id" column in EXPLAIN output, so the printout
above can serve to decrypt this "id" column, which was so far just described in
the manual as "The SELECT identifier. This is the sequential number of the
SELECT within the query." in
http://dev.mysql.com/doc/refman/5.5/en/explain-output.html (it's not trivial,
when there are nested subqueries and unions, to know how the parser progresses,
and thus not trivial to know what select number 5 corresponds to; in the
printout it's more obvious).
- We must note clearly that traces are subject to change; and we do it
frequently as we understand more optimizer code and refactor it.

Future
======
More ideas are pushed to WL#5297 which will happen later.

All the Low-Level Design is available, as Doxygen comments in file
sql/opt_trace.h of the tree, or much more readable in autogenerated HTML at:
[usual adress of our intranet]/~gbichot/mysql-wl4800/html/OPTIMIZER_TRACE.html
The above is the main page which has an introduction and then links to each
class which has Doxygen comments.
A unit test has been added, to check that valid usage of the API produces
expected JSON output, that invalid usage produces invalid JSON, and to test some
important API functions.