WL#3985: Subquery optimization: smart choice between semi-join and materialization

Affects: Server-6.0   —   Status: Complete

At the moment, subqueries that can be handled by both semi-join and
materialization are always handled using some semi-join strategy. The goal of
this WL entry is to do some cost or heuristics-based choice between semi-join
and materialization.

1. Task setting
1.1. Considerations about when each strategy is better
2. Fitting materialization into join optimization and execution
2.1 Changes in the join executor
2.1.1 Plan refinement step changes
2.2 Changes in the join optimizer
2.2.1 Cost of materialization
2.2.2 Fitting materialization into the join optimizer
2.2.3 Materialization and join buffering
3. User interface
3.1 EXPLAIN output changes
3.2 Control from SQL level
3.2.1 @@optimizer_switch
3.2.1 Subquery hints
4. Extras and Todos
4.1 Correlation wrt constant tables
4.2 Igor's idea: check for subquery "late"

At the moment, subqueries that can be handled by both semi-join and 
materialization are always handled using some semi-join strategy. The goal
of this WL entry is to do some cost or heuristics-based choice between
semi-join and materialization.

Semi-join vs materialization

1. Task setting
We consider subqueries that can be handled by both semi-join and
materialization strategies. This means that:

 * the subquery is an IN-subquery
 * the subquery a [top-level] AND-part of the WHERE/ON clause
    - it satisifes all semi-join's requirements, i.e. it is a single
      select w/o grouping, aggregates, etc.
 * The subquery's parent join has enough join bits so that it is possible 
   to convert subquery into semi-join.
 * the subquery is uncorrelated (this is required by materialization)

At the moment, semi-join is always chosen over materialization. However, there
are cases where materialization is faster than semi-join. The goal of this WL
is to develop a code to recognize such cases.

1.1. Considerations about when each strategy is better

Semi-join is better when the subquery is 'strongly correlated', and there are
indexes which will allow us to scan a small fraction of total subquery's
record combinations.

If that is not the case, semi-join execution will cause repeated enumeration
of the same big set of record combinations, and in that case it will pay off 
if we do materialization: run the subquery once, create the lookup index, a
and do index lookups instead of re-running the subquery.

psergey-todo: do these words matter anymore:
(thesis about "it makes no sense to do interleaving with uncorrelated subquery"
  and IN-equality based counterexample)

All of the above should be automatically taken into account by cost-based
materialization-aware join optimizer.

2. Fitting materialization into join optimization and execution

The idea is to delay the choice until the join optimizer. This can be done as

* Convert the subquery into semi-join nest, mark the semi-join nest as 
  "suitable for materialization" (note: according to Monty, the limit on
  number of bits in table_map should not be an issue. We'll be able to switch
  to bigger table_map if the need arises. But so far we haven't seen any 
  demand for this)

* Add materialization as another way to execute suitable semi-join nests.
  Actually, this means we're introducing bushy join execution plans.

2.1 Changes in the join executor

Subquery materialization will be handled in a manner similar to join
buffering. Subquery tables will occupy a continuous range in the join order,
i.e. the join order will have form:

  ot1 ... otK (( it1  it2 ... itN )) ot{k+1} ...

When the NL-join execution reaches table it1 for the first time, regular join
execution will be interrupted to perform materialization.

  The procedure of materialization is to enumerate all (it1, ..., itN) record
  combinations and put them into a temporary table.

After that we'll make a lookup in the temporary table which will show whether
the execution should continue to ot{k+1} or return to otK.

For all subsequent cases, whenever NL-join execution reaches table it1,
NL-join execution in the ((it1; itN)) range will be buypassed and replaced
with a single lookup in the materialized temporary table.

2.1.1 Plan refinement step changes
In order to prepare for the execution step, we'll need perform these actions
at plan refinement step:

* Create the materialization temporary table
* Set up JOIN_TAB::next_select functions to do subquery materializaion and 
  index lookup.

2.2 Changes in the join optimizer

2.2.1 Cost of materialization
Without materialization, cost and fanout of a partial join order are calculated
using these recursive formulas (taken from best_extension_by_limited_search()):

    current_record_count= record_count * join->positions[idx].records_read;
    current_read_time=    read_time + join->positions[idx].read_time;

    read_time= record_count * cost(table_access)

    or, when the table uses join buffering,

    read_time= number_of_times_join_buffer_is_refilled * cost(table_access)

Materialization is different, for it we have:

    current_record_count= record_count; // fanout==1 always

    mat_read_time= materialize_cost + record_count * lookup_cost

materialize_cost is a cost to materialize subquery's result:

    materialize_cost= subquery_run_cost +  // cost to execute the subquery
                      storage_cost         // cost to populate the temp. table

subquery_run_cost is calculated as cost of running a join of subquery tables.

storage_cost is the cost of populating the temporary table. It should depend
 * number of records we expect to have in the table
 * record size
 * whether the table is expected to be HEAP or MyISAM table. The table will be
   MyISAM if it is larger than max. heap table size.

The formula should be rougly as follows:

  storage_cost= (record_size * #records < max_heap_table_size)?

2.2.2 Fitting materialization into the join optimizer
We need values of outer columns before we can make a lookup. This means that
subquery tables must be located after all of the outer correlated tables
(psergey-todo: put a note re how this works with equality propagation)

  Except for the strategy where we scan the materialized table instead of
  making lookups. In that case subquery tables may be located before the outer

Within our current cost model, a lookup in the temporary table that gives at
most one match is the best possible access method (its cost is 1.0, regardless
of the table size).  We also always have fanout >= 1.0, which means that we
ignore the fact the join might have no outputs and we will never have to
materialize the subquery. As a consequence, we have this heuristic:

  * Materialization tables should be put right after the last outer correlated
    table (psergey-todo: this rule may become more complicated if we take into
    account equality propagation).

psergey-todo: detailed look at Materialization vs FirstMatch costings.
psergey-todo: Take testcase BUG#28257 and BUG#11254 and check whether the
              fastest strategy will be chosen with the formulas we have here.

Let's also note that if there is a join order that can be executed using
materialization, that join order can be also executed with first match
strategy, and the choice between the two depends on the available access
methods and costs.

This hints that we should use the "fork" apporach:

  bool considered_materialization= (semi_join_map)0;
  for (JOIN_TAB **pos= join->best_ref + idx ; (s= *pos) ; pos++)
    considering_materialization_now= FALSE;
      At the location where we would call best_access_path():
    if (next table belongs to materialization-suitable semi-join nest SJM &&
        !(considered_materialization.bit_set(SJM.sj_bit) &&
        no tables of SJM are in the join order yet && 
        all outer tables used in SJM's IN-equality are in the join
          order prefix)

      // Consider executing SJM using materialization:
      put all subquery's tables into the join order prefix;
      add subquery cost; // this is instead of best_access_path
        Save this so that when we have a join prefix of (ot1,ot2)
        and possible extensions of (it1, it2), we don't consider
        materialization two times for
         ot1 o2 materialize(it1, it2), and
         ot1 o2 materialize(it2, it1)
      // This is to restore things correctly at the end of the loop iteration
      considering_materialization_now= TRUE;

    Here goes the heuristic pruning code
    and the recursive best_extension_by_limited_search call

    if (considering_materialization_now)
      remove subquery's tables from the join order prefix;
      // Now consider the same extension but w/o materialization
      considering_materialization_now= FALSE;
      goto loop_start;
      // Regular loop iteration end
      restore_prev_sj_state(remaining_tables, s);

psergey-todo: What if the number of tables in the subquery exceeds the depth 
  of the heuristics? Check if everything is ok in that case: do comparisons of
  of partial plans like

    join_prefix(prefix_tables, materialization(#subquery_tables))
    join_prefix(prefix_tables, number-less-than-subquery_tables)
  make sense? Or we should adjust for different numbers of tables?

In general case the join optimizer will consider many join orders that use
materialization, so it makes sense to not re-calculate the optimal way to
execute a subquery every time. We will pre-calculate subquery's cost and
output cardinality before starting the main join optimization run.

The pre-calculation step is just an invocation of the join optimizer for the
contents of the subquery's semi-join nest. (we might need to make certain
adjustments if the subquery's semi-join nest is contained within some outer
join nest. That is to be covered in the LLD)

2.2.3 Materialization and join buffering
It is possible to use join buffering for subquery tables, except for the first
one. However, we must take into account that the join buffer should contain
only record combinations of the subquery tables (and not record combinations
of all preceding tables).

TODO ^ the above is possible only because subquery is run only once, right?

Materialization strategy doesn't prevent use of join buffering for the outer
tables. There is however a difference that outer tables' join buffers must 
not contain subquery's tables rows, as they is not needed and will only 
waste join buffer space.

3. User interface

3.1 EXPLAIN output changes

Let the subquery's tables have select_type='MATERIALIZE'. This doesn't make 
EXPLAIN output any wider. (TODO: check if the above guarantees unambiguous 
interpretation of EXPLAIN output when there are several nested subqueries)

3.2 Control from SQL level

3.2.1 @@optimizer_switch
Join optimizer will not consider materialization if @@optimizer_switch has 
'no_materialization' flag set.

If @@Optimizer_switch includes 'no_semijoin' flag, subquery will not be
converted into a semi-join nest at all. 
(NOTE: the other option would be to convert the subquery into semi-join nest
but then force a join order that uses semi-join materialization strategy.

ATM there is no meaningful difference between the two but in the future there
might be.)

3.2.1 Subquery hints
Subquery may have:
  SELECT [ /* MATERIALIZE */ | /* NO_MATERIALIZE */ | /* FLATTEN */ ] ... 

or something like that. This is low priority.

4. Extras and Todos

4.1 Correlation wrt constant tables
Suppose a subquery satisfies the semi-join criteria but doesn't meet
materialization's requirements because it is correlated. If we later find
that it is correlated only wrt constant tables, then the subquery can be
considered uncorrelated and handled with materalization.

Should we bother detecting/handling this case? (will need to check how
expensive this is when we get the LLD draft)

4.2 Igor's idea: check for subquery "late"

The idea is that the check whether we should do materialization should be done
"late", i.e. it should have the following form:

in best_extension_by_limited_search(), right after best_access_path() call:
if all of the subquery tables form a suffix in the current join order prefix,
consider also materialization.

TODO: relative merits of both approaches?

Late materialization detection
 - doesn't that try the same plan too many times? that is, we'll consider
   materialization for every permutation of subquery's tables.
   (and we cant' try materialization only for some particular permutation as
   it might occur that that permutation is not considered because of heuristic
 + less query plans to consider.

At the moment SergeyP doesn't think late detection is worth it.
The following is a structured todo list. Each item is a milestone, implementation 
is to be performed in the order items are listed 

M1. Materialization detector
Let subquery-predicate -> semi-join nest conversion code mark the semi-join
nests that can be handled by materialization, or create an appropriate 

  bool TABLE_LIST::is_materializable_sj_nest()

function (The criteria are easy - just check whether the original subquery
predicate was uncorrelated).

M2. Make the join optimizer support materialization

M2.1 Choose join sub-order for materialization join nests

M2.2 Consider/choose materialization in join optimization
Make the decision between early and late materialization join order detection
(at the moment leaning towards early detection) and implement it in the join
optimizer. The result should be that we produce a marked join order but it is
still executed with duplicate elimination.

M3. Make join executor support materialization
Create a Next_select_func-compatible function that will implement the
materialization strategy.

M3.1 Support for EXPLAIN
Let EXPLAIN show the materialization strategy. This is to be done together
with M3.

M4. Support sequential scans of materialized table

This is the last milestone where we will add support for 

 * Performing the materialization step before we've read the outer tables
 * Sequentially scanning the temporary table.

M4A. Insideout order for sj-materialization

(this part was discovered and added later in development. To be done as the last

Make the at_sjm_pos() check also check if we could use inside-out-sjm. The
criteria are the same except that we don't require the outer-correlated tables
to be in the join prefix.

The cost of inside-out-sjm is composed of:

* The cost to populate the temp. table, once per join execution
* The cost to do a full temp. table scan. Since this is a full scan, it can 
  be used together with join buffering.

Note: for tuple-based subqueries we could also do index prefix lookups in the
materialization table, i.e. we could have a subquery

  (tbl1.a, tbl2) IN ( SELECT ie1, ie2 FROM inner_table WHERE),

a join order of 

   tbl1, SJM(inner_table), tbl2 

and have SJM do lookups using materialization_temptable.keypart1=tbl1.a. We
will not implement this.


Execution stage will:

 * Do materialization in the way SJ-Materialization code does it
 * Then use full scans for subquery table. The said scan may be done using
   join buffering.

 * same sjm-materialize function, but using a flag which will tell whether to
   do an index scan or full scan.
   NO. Not exactly same as this may be using join buffering.
   Q: Why do that? Do we really need join buffering here? 
   AA: ok may disable the join buffering for the 1st implementation.

 * If a subsequent table will use join buffering, its join_init_cache() call 
   must take sjm-scan into account and put the materialized column into join
   Q: what if the materialized column is not base table column but a function?
   A: disable this case for now.

InsideOut orders for non-semi-join subqueries

Consider this case: a subquery is an AND-part of the WHERE but can't be 
executed by semi-join runtime because it has GROUP BY/ORDER BY .. LIMIT/etc:


It is possible to use a variant of insideout join order for such queries, and it
can be a great advantage to do so.

It is not possible to use semi-join materialization code for such subqueries
though, so we'll use special handling:

Optimizer part
The subquery will be represented by a JOIN_TAB element of a special kind. The 
element will allow two kinds of accesses:

* lookup (with cost and all characteristics of an access method)
* full scan (this is what we're mainly after)

Join optimizer will be able to pick position for the join tab so the cost is 
minimized. This will map to the execution strategies described further.

Note: don't do all this to subqueries that are within outer joins (is that 

Executioner part

* lookup. Done in a similar way to semi-join lookup.
* full scan: we'll need to use subselect_hash_engine code to produce the 
  temporary table and then we'll be able to scan it as usual.

Other possible subquery materialization gotchas

* insideout + range access MUST use HA_MRR_SORTED. Unordered DS-MRR or cluster
  scan will screw up everything.

* range+insideout. Why chosen so rarely?? 

* Timour's subqueries must show "MATERIALIZE" to be consistent with our 
  EXPLAIN lines.

WL#3985 Additional specification

This document describes the differences between the original WL#3985 spec
and the committed code.

1. Join optimizer changes
1.1. First Match optimization
1.1.1 Problems when using FirstMatch and late detection
1.1.2 Solution: forks
1.3 SJ-Materialization
1.4 Sj-Materialization scan
1.5 DuplicateWeedout

1. Join optimizer changes

In order to make a cost-based choice between semi-join and materialization,
we need to calculate the costs and fanout of every subquery semi-join strategy.

1.1. First Match optimization
FirstMatch strategy can be used to resolve one or several interleaved
semi-join nests, and its applicability criteria are:

1. The join order has form 

    (ot|nt)+  (it|nt)+   ot*
             ^         ^
            (1)       (2)

2. Tables between points (1) and (2) do not use join buffering

There is no extra cost in using FirstMatch strategy. The fanout produced by
sj-inner tables is removed at point (2).

1.1.1 Problems when using FirstMatch and late detection
The requirement not to use join buffering presents a problem - use of join
buffering affects the access costs and optimal order for tables within the 
(1)...(2) range, so if we implement "late detection" we can get into scenarios
like this:

* We run the join optimizer and allow to use join buffering
* At some point we figure we're at location (2) // and where the
                                                // corresponding point (1) is
* We want to have correct join cost value, so we go back to (1)
  and re-run the join optimizer with join buffering disabled.

  The problem is that if we diable join buffering we may figure that there 
  exist tables in the (1)-(2) join suborder that should not have been put 

  In other words: if we denote
    best_extension(join_prefix, N) =  {ta1, ... taN}
    best_extension(join_prefix, N, disable_join_buffer=TRUE)= {tb1, ... tbN}

    list_compare({ta1, ... taN}, {tb1, ... tbN}) != 0

  and even
    set_compare({ta1, ... taN}, {tb1, ... tbN}) != 0

This makes it very difficult to use "late detection" for FirstMatch and have
correct cost value and not introduce limitations that could cause best join
order to be never considered.

1.1.2 Solution: forks
Solution to the above problem is to use early detection and a "fork":

 * whenever we discover we are at position (1), create a separate branch in
   join order enumeration.  The branch will be used to check the FirstMatch 
   strategy. Within the branch

    - join order extension is limited so that FirstMatch strategy's 
      requirements are met (don't add inner tables which do not have all of 
      their outer tables in the prefix before point (1)).
    - join buffering is disabled until point (2) is reached.

Introduction of branches increases the number of different combinations
enumerated by the join optimizer. Each semi-join nest causes an increase
equivalent to addition of one more join table.

1.2 LooseScan optimization
LooseScan is similar to FirstMatch: 
 - it doesn't allow to use join buffering within it's range
 - it places limits on what tables are allowed in its duplicate-generating
 - In addition, it uses a special access method for the 1st table.

All of the above can be addressed by forking, so we use forks approach for
this case, too. 

FirstMatch and LooseScan's applicability criteria are mutually exclusive so
we don't ever get into two forks at once.

1.3 SJ-Materialization
SJ-Materialization is handled with pre-optimization and "late detection".

Pre-optimization means that we compute the optimal sub-join order for
materialization before the main join optimization start.

Late detection works as described in the main WL#3985 spec:

  ot1 ... otN (it1 .. itN)
             ^            ^ 
            (1)          (2)

Once we have detected we are at position (2), we adjust the join prefix output
cardinality and cost values accordingly.

Since SJ-Materialization ignores the results of join optimizer's work in the 
(1)-(2) range, we don't need any join optimizer forks.

1.4 Sj-Materialization scan
SJM-scan has additional complication with fanout. Consider a join order with

  SJM-SCAN(it1 ..  itN) ot1 ... otN nt1 .. 
                       ^           ^
                      (1)         (2)

At position (1) join prefix output cardinality is a product of fanouts of
tables it1 ... itN. In general case, it is greater than 1.  

On the other hand, at point (2), join prefix cardinality is a product of
outer tables only.

This can get even more complicated if there are two SJM scans. Consider the
query and the join order
  SELECT ...
  FROM  otA1, otB1, otB1, otB2 
    expr(otA1, otA2) IN (SELECT ... FROM itA1 .. itAN) AND
    expr(otB1, otB2) IN (SELECT ... FROM itB1 .. itBK) 

  SJM-SCAN(itA1 ..  itAN)  otA1  SJM-SCAN(itB1 ..  itBK)  otB1 otA2 otB1
                                                              ^         ^
                                                             (1)       (2)

Here the fanout introduced by (itA1 .. itAN) is removed at point (1) and 
fanout introduced by (itB1 .. itBK)

1.5 DuplicateWeedout
DuplicateWeedout does not create additional forks. It is detected late and it
has a special property that it is the default catch-all method.

   ... ot1 it1 ....    itN
   x======================x -- duplicate generation area
                            \_ consider duplicate weedout when we get here

The point when we consider DuplicateWeedout is the latest point where we
consider any semi-join optimization strategy.  This is useful - if we see that 
we have arrived at point (2) and there are duplicates that are not handled by
any semi-join strategy, we unconditionally use duplicate weedout (and add its
cost).  Otherwise, we compare the cost of duplicate weedout with costs of
other strategies.