WL#4168: Implement EXPLAIN ANALYZE

Affects: Server-8.0   —   Status: Complete

This task is to implement EXPLAIN ANALYZE. The command should run the query and
then produce EXPLAIN output with additional information about how optimizer's
expectation matched the actual execution.
F1. EXPLAIN ANALYZE should be able to time any query that runs in the iterator 

F2. The measured row counts should be correct at every step; in particular, the 
top iterator's measured row count should match the actual number of returned 

F3. EXPLAIN ANALYZE FOR CONNECTION should give an error message.

NF1. Performance should be unaffected for queries not using EXPLAIN ANALYZE.
We introduce the new SQL syntax:


This automatically selects FORMAT=tree, and also executes the query (with no 
output to the user). EXPLAIN ANALYZE FOR CONNECTION is not supported.

Building on the new iterator executor, it is fairly simple to implement timing 
iterators on top of the real iterators, giving precise 
information about each call.

Specifically, for each iterator, we can implement a templated TimingIterator 
was wraps each Init() and Read() call with clock calls. (These 
become non-virtual calls, since all the iterator implementation classes are 
final. In some cases, they could be inlined.) This has naturally 
some overhead, but it seems this is only around 5–6% for an instrumented query 
(entirely free for a non-instrumented query).

For each iterator, we will present the information:

 - The time to return first row (Init + first Read), ie., the startup cost for 
each iteration in milliseconds. For most iterators, this is 
nearly zero, but for e.g. materialize or sort with a big tree below them, it 
would be much bigger.
 - The time to return all rows (Init + all Read calls). This is equivalent to 
our current idea of "cost".
 - The number of rows returned by this iterator.
 - The number of loops (number of Init calls). Multiplying this by the time 
used will give total amount of wall clock time contributed by this 
iterator. (Our costs compute the pre-multiplied value, but this doesn't make 
all that much sense if the join method is not nested loop, where a 
lot of the cost can be in the initialization.)

In addition, we would like to present the estimated costs and row counts
wherever possible; comparing these to the real information will allow the user
to get an idea of why the given plan was chosen whenever it's not obvious (ie.,
to spot misestimations). A lot of the iterators have no proper cost evaluation; 
e.g. sorts are essentially taken to be free. However, table 
iterators (table scan, ref access, etc.) have cost and row estimates, 
conditions have filtering effect estimates (which we can multiply by the 
row estimate to get a new row estimate), and our joins have cost and row 
estimates for the entire join. This won't be complete, and it won't be 
perfect (e.g., since the conditions are split into multiple FilterIterators 
with pre- and post-join filters), but it will generally be useful 

Note that costs are in a different unit (arbitrary numbers) from the 
measurements (wall clock milliseconds); there's nothing to do about this. 
However, the relative values will still be useful for comparison.

A real, fairly complex example from DBT-3 looks like this (hoping that the WL 
system won't break the lines):

 -> Nested loop inner join (cost=264528.90 rows=112505) (actual 
time=3.029..3765.266 rows=144544 loops=1)
     -> Filter: (customer.c_mktsegment = 'AUTOMOBILE') (cost=155190.00 
rows=15000) (actual time=0.905..341.270 rows=29752 loops=1)
         -> Table scan on customer (cost=155190.00 rows=150000) (actual 
time=0.893..289.690 rows=150000 loops=1)
     -> Filter: (orders.o_orderDATE < DATE'1995-03-14') (cost=30.00 rows=8) 
(actual time=0.096..0.114 rows=5 loops=29752)
         -> Index lookup on orders using i_o_custkey 
(o_custkey=customer.c_custkey) (cost=30.00 rows=15) (actual time=0.096..0.112 

Here we can see that the nested loop takes about 3.7 seconds to run, and 
outputs 144k rows, not too far off from the estimated 112k (even though 
some of the sub-iterators are a bit off). Interestingly, scanning the 
"customer" table takes only 341 ms, ie. 10% of the time in the join. What 
really costs is each index lookup, with 0.1 ms/lookup; this demonstrates that 
secondary index lookups in InnoDB are expensive, since they need 
to go through the primary key. For comparison, here's another part of from the 
same query:

  -> Index lookup on lineitem using PRIMARY (l_orderkey=orders.o_orderkey) 
(cost=2.02 rows=4) (actual time=0.051..0.053 rows=4 loops=144544)

So the relative cost per-row is roughly 2x.

Longer-term, we'll need to work with Workbench to provide a JSON version of the 
same information, for graphical consumption.

JSON format (future, not covered by this worklog)

The proposed JSON format (for use by MySQL Workbench) closely mirrors
the textual tree format. Consider this EXPLAIN ANALYZE:

-> Nested loop inner join  (cost=1.45 rows=3) (actual time=0.006..0.006 rows=0 
    -> Filter: (a.b = 500)  (cost=1.35 rows=1) (actual time=0.005..0.005 rows=0 
        -> Table scan on a  (cost=1.35 rows=3) (actual time=0.004..0.005 rows=3 
     -> Table scan on b  (cost=0.25 rows=3) (never executed)

This would translate into:

  "explain_analyze": {
    "text": "Nested loop inner join",
    "estimated_cost": 1.45,
    "estimated_rows": 3,
    "actual_time_first_row_ms": 0.006,
    "actual_time_all_rows_ms": 0.006,
    "actual_rows": 0,
    "actual_loops": 1,
    "children": [
        "text": "Filter: (a.b = 500)",
        "estimated_cost": 1.35,
        "estimated_rows": 3,
        "actual_first_row_ms": 0.005,
        "actual_all_rows_ms": 0.005,
        "actual_rows": 0,
        "actual_loops": 0,
        "children": [
          /* even more sub-children here */
        "text": "Table scan on b",
        "estimated_cost": 0.25,
        "estimated_rows": 3,
        "actual_first_row_ms": null,
        "actual_all_rows_ms": null,
        "actual_rows": null,
        "actual_loops": null
        /* no "children" */